What Would You Want In a Large-Scale Monitoring System? 342
Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
Before I get flamed... (Score:4, Interesting)
I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.
Re:OpenNMS (Score:5, Interesting)
I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.
Re:Zenoss (Score:2, Interesting)
A little tough to setup new SNMP devices, I thought, but overall a great product. Even the free version gets you quite far.
Nagios Might Work (Score:3, Interesting)
I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).
If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.
The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.
MRTG also integrates w/Nagios, which can be useful.
Good luck.
ZenOSS all the way (Score:5, Interesting)
Cons:
Roll you own... (Score:2, Interesting)
I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.
The system did push and pull operations. First, the system was built in PHP.
In order to push commands to the servers I used PEAR SSH2 class for communication when it became stable. Another option (and what I did back in 2003) was to use exec and other command line functions in PHP in conjunction with a SETUID script (written in C) -- which gave the command line output from PHP "true" rootly powers. The problem was I had to enter a password for each server I wanted to connect to, and the PHP functions couldn't handle real time input/output, so I designed the system to work by creating an SSH2 key pair on my master monitoring server and put it's public key on each of our external servers for passwordless SSH.
The pull part of the system simply had a PHP script running on a cron per server, that would deliver information about the health of the server, its running processes, etc, to the main SMS server every 5 minutes. All load activity for all servers was logged as well to MySQL. The push operations were used to update those scripts, as well as restart Daemons on command, clear cache (such as after we did a database update), etc. It was a pretty robust system and really automated the functions of our company, to where we could perform a FULL Database Update to our 30 web servers simultaneously (using PHP and fork()), clear all cache, etc, in under an hour. We would the monitor the servers using the SMS's main screen which showed real time server stats (updated every 5 minutes, or you could "force" a push operation to get the status). If we needed to rollback the update, that was a simple mouse click away too.
I also had a hidden screen that let me run any series of commands as root on any number of servers. Everyone objected to it but I convinced my boss to let me put it in. All of our servers were a mouse click away from being "rm -rf *" 'ed. ROFL. Anyway, I hope my little story about my system helps you out, in either avoiding what I did (LOL) or by giving you ideas.
Re:A more interesting question (Score:4, Interesting)
The big questions are:
Will your solution need to support snmp v3?
Do the devices you talk to have published oids?
Do you need source code to extend it?
If yes to these, OpenNms is a great bet.
I Hate War Rooms (Score:5, Interesting)
I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.
What you want in large-scale monitoring is:
Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.
Cacti w/plugins (Score:3, Interesting)
I use Cacti, with THold [cacti.net] and weathermap [cacti.net] plugins.
But then I'm biased.
Monitoring on scale (Score:3, Interesting)
If your monitoring something of that scale, you should probably look into a profession solution.
I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.
Zabbix is easy to maintain and flexible (Score:4, Interesting)
Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.
Re:Use the Tivoli architecture and rewrite it (Score:3, Interesting)
Re:Zenoss (Score:5, Interesting)
And this is why we (OpenNMS) don't play the per-node. It's not any harder to run OpenNMS when managing 1000 nodes than when managing 100, you only need to scale hardware appropriately. Per-node pricing is an artificial limitation.
We also don't play the "you get a special price behind closed doors" game, our support prices are public, fair, and the same for everyone [opennms.com] -- and that's only if you need commerical support -- our prices are $0 if you don't need or want support.
If you do the math, it's $0 for the software, plus $14,995/year for support for any number of nodes, and the software is 100% open-source and fully capable of replacing or exceeding OpenView [opennms.org]. ;)
Re:I Name My Devices After Al Qaeda Members (Score:2, Interesting)
I get "Insightful" for THIS, but my other much more important comments get nothing or even "Troll"??
Slashdot got seriously weird / fucked up, in the last time...
People, please learn that even if you completely disagree, it still is no troll, but very insightful. Because without it you would not be able to disagree with it and come up with your own point of view in the first place! And sometimes making you disagree is just the point of a good argument!