What Would You Want In a Large-Scale Monitoring System?

What Would You Want In a Large-Scale Monitoring System? 342

Posted by timothy on Wednesday July 08, 2009 @05:30PM from the detect-curse-words-and-fine-a-quarter-apiece dept.

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"

What Would You Want In a Large-Scale Monitoring System?

This discussion has been archived. No new comments can be posted.

Search 342 Comments Log In/Create an Account

Comments Filter:

Before I get flamed... (Score:4, Interesting)

by jwilki1 ( 463599 ) writes: on Wednesday July 08, 2009 @05:37PM (#28628701)

I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.

Re:OpenNMS (Score:5, Interesting)

by mu51c10rd ( 187182 ) writes: on Wednesday July 08, 2009 @05:49PM (#28628863)

I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.

Re:Zenoss (Score:2, Interesting)

by NuclearRampage ( 830297 ) writes: on Wednesday July 08, 2009 @05:49PM (#28628879)

A little tough to setup new SNMP devices, I thought, but overall a great product. Even the free version gets you quite far.

Nagios Might Work (Score:3, Interesting)

by hax4bux ( 209237 ) writes: on Wednesday July 08, 2009 @05:50PM (#28628881)

I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).
If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.
The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.
MRTG also integrates w/Nagios, which can be useful.
Good luck.

ZenOSS all the way (Score:5, Interesting)

by Midnight Warrior ( 32619 ) writes: on Wednesday July 08, 2009 @05:52PM (#28628913) Homepage
We use ZenOSS [zenoss.com] exclusively at work and have enjoyed every minute of it. Pro's include:
- 2D map with status of all nodes or submaps, organized by network
- Application monitoring, with more advanced maps available for purchase (Oracle, JBoss, Cisco) for those things you already paid a lot of money for
- Performance monitoring via SNMP or other data sources using RRDtool internally which includes graphs linked to each other during zoom in/out or panning
- Nagios plugins already do some of the heavy lifting
- Built-in support for watching Windows servers (any metric accessible via WMI)
- Access control using at least LDAP and Active Directory
- Secondary data collectors for those networks which are too big for just one central source
- Highly customizable through Python
- It has so, so much more than pathetic commercial solutions like OpenView
Cons:
- You have to keep your eye on the back end database
- It still takes a long, long time to tune it to remove noise events
- If you don't know Python, it can be tough in a few places
- Proper support is not cheap
Roll you own... (Score:2, Interesting)

by hofmny ( 1517499 ) writes: on Wednesday July 08, 2009 @06:01PM (#28629033)

I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.

I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.

The system did push and pull operations. First, the system was built in PHP.
In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.

The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.

I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.

Re:A more interesting question (Score:4, Interesting)

by abigor ( 540274 ) writes: on Wednesday July 08, 2009 @06:26PM (#28629275)

The big questions are:
Will your solution need to support snmp v3?
Do the devices you talk to have published oids?
Do you need source code to extend it?
If yes to these, OpenNms is a great bet.

I Hate War Rooms (Score:5, Interesting)

by afabbro ( 33948 ) writes: on Wednesday July 08, 2009 @06:33PM (#28629355) Homepage
I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.
What you want in large-scale monitoring is:
- The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".
- I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.
- I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.
- I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.
- Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.
- I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.
- I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.
- I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.
- I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.
- I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.
Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.
Cacti w/plugins (Score:3, Interesting)

by Linegod ( 9952 ) writes: <pasnak@nOSpAm.warpedsystems.sk.ca> on Wednesday July 08, 2009 @06:47PM (#28629499) Homepage Journal

I use Cacti, with THold [cacti.net] and weathermap [cacti.net] plugins.
But then I'm biased.

Monitoring on scale (Score:3, Interesting)

by C_Kode ( 102755 ) writes: on Wednesday July 08, 2009 @07:23PM (#28629849) Journal

If your monitoring something of that scale, you should probably look into a profession solution.
I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.

Zabbix is easy to maintain and flexible (Score:4, Interesting)

by bigtrike ( 904535 ) writes: on Wednesday July 08, 2009 @08:31PM (#28630563)

Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.

Re:Use the Tivoli architecture and rewrite it (Score:3, Interesting)

by Krneki ( 1192201 ) writes: on Wednesday July 08, 2009 @08:39PM (#28630651)

I don't believe in agents. I refuse to install anything not needed on the server. SNMP should be enough for all the information, unfortunately this is not the case. So I use WMI and netbios querying.

Re:Zenoss (Score:5, Interesting)

by Ranger Rick ( 197 ) * writes: <slashdot@racc o o n f i n k .com> on Wednesday July 08, 2009 @08:46PM (#28630713) Homepage

And this is why we (OpenNMS) don't play the per-node. It's not any harder to run OpenNMS when managing 1000 nodes than when managing 100, you only need to scale hardware appropriately. Per-node pricing is an artificial limitation.
We also don't play the "you get a special price behind closed doors" game, our support prices are public, fair, and the same for everyone [opennms.com] -- and that's only if you need commerical support -- our prices are $0 if you don't need or want support.
If you do the math, it's $0 for the software, plus $14,995/year for support for any number of nodes, and the software is 100% open-source and fully capable of replacing or exceeding OpenView [opennms.org]. ;)

Re:I Name My Devices After Al Qaeda Members (Score:2, Interesting)

by Hurricane78 ( 562437 ) writes: <deleted @ s l a s h dot.org> on Thursday July 09, 2009 @05:10AM (#28633683)

I get "Insightful" for THIS, but my other much more important comments get nothing or even "Troll"??
Slashdot got seriously weird / fucked up, in the last time...
People, please learn that even if you completely disagree, it still is no troll, but very insightful. Because without it you would not be able to disagree with it and come up with your own point of view in the first place! And sometimes making you disagree is just the point of a good argument!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What Would You Want In a Large-Scale Monitoring System? 342

What Would You Want In a Large-Scale Monitoring System? More Login

What Would You Want In a Large-Scale Monitoring System?

Before I get flamed... (Score:4, Interesting)

Re:OpenNMS (Score:5, Interesting)

Re:Zenoss (Score:2, Interesting)

Nagios Might Work (Score:3, Interesting)

ZenOSS all the way (Score:5, Interesting)

Roll you own... (Score:2, Interesting)

Re:A more interesting question (Score:4, Interesting)

I Hate War Rooms (Score:5, Interesting)

Cacti w/plugins (Score:3, Interesting)

Monitoring on scale (Score:3, Interesting)

Zabbix is easy to maintain and flexible (Score:4, Interesting)

Re:Use the Tivoli architecture and rewrite it (Score:3, Interesting)

Re:Zenoss (Score:5, Interesting)

Re:I Name My Devices After Al Qaeda Members (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot