What Would You Want In a Large-Scale Monitoring System?

What Would You Want In a Large-Scale Monitoring System? 342

Posted by timothy on Wednesday July 08, 2009 @05:30PM from the detect-curse-words-and-fine-a-quarter-apiece dept.

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"

What Would You Want In a Large-Scale Monitoring System?

This discussion has been archived. No new comments can be posted.

Search 342 Comments Log In/Create an Account

Comments Filter:

A more interesting question (Score:5, Insightful)

by drsmithy ( 35869 ) writes: <drsmithy@nOSPAm.gmail.com> on Wednesday July 08, 2009 @05:35PM (#28628691)

What limitations exist in current solutions that justifying developing a new one from scratch ?

The Dangers of averaging (Score:5, Insightful)

by Anonymous Coward writes: on Wednesday July 08, 2009 @05:40PM (#28628753)

MRTG does it right...most of the others do it wrong
When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
So your 380Mbps peak that you had an hour ago is fine on today's graph
But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
and next week, when you look at "last weeks" graph...there's a little 50Mbps peak
Damnit... I want to keep information on my peaks for capacity planning!

Re:A more interesting question (Score:4, Insightful)

by Meshach ( 578918 ) writes: on Wednesday July 08, 2009 @05:45PM (#28628817)

What limitations exist in current solutions that justifying developing a new one from scratch ?
Exactly! Too often people just jump in and redo everything without actually investigating what needs to be fixed. Quote from George Santayana "Those who cannot learn from history are doomed to repeat it,' seems very appros here.

The mistake (Score:4, Insightful)

by vlm ( 69642 ) writes: on Wednesday July 08, 2009 @05:55PM (#28628949)

The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.
Give them a text screen of whats currently down ... that'll work.

Re:OpenNMS (Score:5, Insightful)

by Cato ( 8296 ) writes: on Wednesday July 08, 2009 @05:57PM (#28628991)

I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.

Don't be like Tivoli, OpenView, etc (Score:3, Insightful)

by duffbeer703 ( 177751 ) writes: on Wednesday July 08, 2009 @06:37PM (#28629399)

Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.

OpenNMS (Score:3, Insightful)

by ckaminski ( 82854 ) writes: <slashdot-nospam.darthcoder@com> on Wednesday July 08, 2009 @06:59PM (#28629629) Homepage

Was a step above Nagios in terms of reliability (I didn't have to restart the server four times a day just to keep it running), and did much better at autodiscovery.

That fact that it is also NRPE compatible was a plus - I could use all the Nagios plugins and check scripts I'd written.

I was also planning on using it to launch a more aggressive webmin-style management solution - since OpenNMS built this great database of data about my devices and hosts, I could use it to do actual management - change data/settings.

Cons: It's a Java/Tomcat tool, as much as that is really a con. It's not like you need to run Jboss or Websphere to use it (though I suppose you could).

Re:A more interesting question (Score:3, Insightful)

by timeOday ( 582209 ) writes: on Wednesday July 08, 2009 @07:07PM (#28629701)

If only you read more than the first sentence of TFSummary: "I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager."
Obviously by "from scratch" he means his company has nothing in place he has to build on; he is free to build a system on whatever tools he likes.

What I Lack in Open Source Monitoring Solutions (Score:5, Insightful)

by rawler ( 1005089 ) writes: <ulrik.mikaelsson ... m ['gma' in gap]> on Wednesday July 08, 2009 @07:10PM (#28629735)

I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.
Transparent Planned Design
Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.
Event management
Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.
Modularity/Seamless Integration
Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.
Complexity
There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.
The Perfect Monitoring System
After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.
What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.

Re:Zenoss (Score:2, Insightful)

by hotfireball ( 948064 ) writes: on Wednesday July 08, 2009 @09:44PM (#28631183)

Zenoss also provided terribly wrong RPM packaging. I don't know how they are now, but exactly 1 year ago it was that wrong. For example, they could simply wipe out some files in /etc where your setup is already done but without any warning or notice. So all the time it is better to setup it from source.
I've also look at Zabbix and got an impression that it is sort of like a bicycle with a squared wheels. Same to Groundworks (a re-packaged Nagios) and Nagios itself. The only thing I find really worth to pay attention at: OpenNMS. So far it also has its own gotchas, but better than others.

Re:OpenNMS (Score:2, Insightful)

by mysidia ( 191772 ) writes: on Thursday July 09, 2009 @12:48AM (#28632437)

MRTG is kind of limited. I would suggest using Cacti instead; it provides useful features like graph zooming and detailed RRD configurations like number of DSes, number of steps and rows per DS in a RRD.
I would try installing the opennms software yourself, and don't rely on that clearly broken demo of the software. The breakage you are seeing of the demo is not at all representative of what the software is normally like.
OpenNMS is not bug free, but neither are any of the viable competitors close to 'bug free'. When setup properly and JVM/memory settings tuned, including proper setup and tuning of the PostgreSQL database per PostgreSQL best practices, with physical resources suitable to the load (e.g. about 4gb of physical RAM and 2x2Ghz CPU plus 80gb disk space to monitor 3000 interfaces), polling each service every 30 seconds, OpenNMS runs like a champ.
It's pretty darn stable. And vastly outperforms other Open Source solutions.
I've hardly ever seen exception conditions like that one, and generally, it would just be the webui having problems.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What Would You Want In a Large-Scale Monitoring System? 342

What Would You Want In a Large-Scale Monitoring System? More Login

What Would You Want In a Large-Scale Monitoring System?

A more interesting question (Score:5, Insightful)

The Dangers of averaging (Score:5, Insightful)

Re:A more interesting question (Score:4, Insightful)

The mistake (Score:4, Insightful)

Re:OpenNMS (Score:5, Insightful)

Don't be like Tivoli, OpenView, etc (Score:3, Insightful)

OpenNMS (Score:3, Insightful)

Re:A more interesting question (Score:3, Insightful)

What I Lack in Open Source Monitoring Solutions (Score:5, Insightful)

Re:Zenoss (Score:2, Insightful)

Re:OpenNMS (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot