Become a fan of Slashdot on Facebook


Forgot your password?
Networking Software IT

What Would You Want In a Large-Scale Monitoring System? 342

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
This discussion has been archived. No new comments can be posted.

What Would You Want In a Large-Scale Monitoring System?

Comments Filter:
  • Publish them in DNS, and have the NSA monitor them for me!

  • Hyperic HQ (Score:2, Informative)

    by Anonymous Coward

    Hyperic HQ may be worth checking out.

  • Don't assume that you can successfully diagnose the problem based on your understanding of the indicators. You don't know my institutional context. Instead, give me a decision support system that I can use by adding rules that key off the monitored indicators and inject some of our own expertise into the diagnostic process.

  • OpenNMS (Score:5, Informative)

    by Anonymous Coward on Wednesday July 08, 2009 @05:34PM (#28628669)

    That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.

    OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.


    • Mod parent up, OpenNMS rules.

    • Re:OpenNMS (Score:5, Interesting)

      by mu51c10rd ( 187182 ) on Wednesday July 08, 2009 @05:49PM (#28628863)

      I use OpenNMS as well. I actually migrated off of Nagios to OpenNMS. Tried out Zenoss and Cacti as well. While any of these are better than OpenView IMHO, I liked OpenNMS's full suite of functionality without having to pay for the 'commercial' version.

      • Re:OpenNMS (Score:5, Insightful)

        by Cato ( 8296 ) on Wednesday July 08, 2009 @05:57PM (#28628991)

        I've only tried OpenNMS. It looks very powerful, but wasn't at all hard to get installed and configured on Ubuntu - it figures out the type of node it has discovered and shows useful data through SNMP, and can also do uptime monitoring, and is generally very scalable and configurable if needed.

    • by abigor ( 540274 )

      I'll second this.

    • Re: (Score:3, Informative)

      OpenNMS offers excellent training and support. My company flew Tarus down to Autralia for a week where he implemented OpenNMS across our four sites in the first day, then spent four days with myself and a colleague training us on its features. The price, including business class flight from America (it's a lonnnng flight and we wanted him semi-conscious on arrival) was absolutely trivial when compared against just the licensing costs of a comparable proprietary product such as HP OpenView. Highly recomme
  • by drsmithy ( 35869 ) <drsmithy@gmai[ ]om ['l.c' in gap]> on Wednesday July 08, 2009 @05:35PM (#28628691)
    What limitations exist in current solutions that justifying developing a new one from scratch ?
    • by Meshach ( 578918 ) on Wednesday July 08, 2009 @05:45PM (#28628817)

      What limitations exist in current solutions that justifying developing a new one from scratch ?

      Exactly! Too often people just jump in and redo everything without actually investigating what needs to be fixed. Quote from George Santayana "Those who cannot learn from history are doomed to repeat it,' seems very appros here.

    • by glassware ( 195317 ) on Wednesday July 08, 2009 @05:46PM (#28628823) Homepage Journal

      He said he was asked to "develop a new solution" - which most likely means he gets to pick and choose what to implement, whether parts of it are custom developed or off the shelf. I would imagine a good solution would be a core product plus custom built extensions for the features he needs that the product doesn't implement itself.

      • Re: (Score:3, Informative)

        by Krneki ( 1192201 )
        Exactly, I need a good core product that I'll evolve over time.
        • by drsmithy ( 35869 )

          Exactly, I need a good core product that I'll evolve over time.

          Why do you want to "evolve" it (by which I'm assuming you mean modify in depth rather than "configure") it at all ? What's missing that you need ?

          System monitoring isn't exactly a fresh and new field. There are numerous well-established and quite comprehensive products already out there.

          • by Krneki ( 1192201 )
            Mostly is SNMP and WMI hacking.

            Not all the vendor respect open standards, so you have to guess how to get the info from an UPS, Printer, ....
            • by abigor ( 540274 ) on Wednesday July 08, 2009 @06:26PM (#28629275)

              The big questions are:

              Will your solution need to support snmp v3?

              Do the devices you talk to have published oids?

              Do you need source code to extend it?

              If yes to these, OpenNms is a great bet.

              • by Krneki ( 1192201 )
                Will your solution need to support snmp v3?
                No, monitoring is inside the intranet, so there is no need for SNMP v3, besides they are configured for read only and access from 1 IP only.

                Do the devices you talk to have published oids?
                Some do, some don't.

                I'll check OpenNms
      • by drsmithy ( 35869 )

        He said he was asked to "develop a new solution" [...]

        From TFSummary:

        Today I have changed employer and I have been asked to develop a new monitoring solution from scratch [...]

        Where I come from, "from scratch" doesn't mean "configure existing solutions to my needs".

        • Re: (Score:3, Funny)

          by ArsonSmith ( 13997 )

          Yea, from scratch, first I'd develop the tools needed to mine the raw materials of silicone, iron, and other needed elements. Then I'd refine them and produce the needed components for memory and processors and storage. as well as develop the new networking, power, form factor etc... Then start working on the boot code and a core kernel, hmm should it be micro/macro or hybrid...? Then I'd start working on interface tools or user space or something along those lines. Once I got this part done I'd start

        • Re: (Score:3, Insightful)

          by timeOday ( 582209 )
          If only you read more than the first sentence of TFSummary: "I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager."

          Obviously by "from scratch" he means his company has nothing in place he has to build on; he is free to build a system on whatever tools he likes.

    • by dave562 ( 969951 )
      The article mentions that he is starting a job at a new employer. The systems that he listed are systems that he has experience with. It seems to me that he's open to the possibility that, despite having had experience with numerous systems, there might be a better way to do things than he has done them in the past.
    • He's just asking what to use

  • by jwilki1 ( 463599 ) on Wednesday July 08, 2009 @05:37PM (#28628701)

    I am going through this right now and am using and have used all the above mentioned solution. We are leaning towards System Center Operation Manager. If you had told me 6 months about that it would be the way to go, I would have said over my dead body, but it has come a very long way in terms of usability and ease of setup.

    • by jmulvey ( 233344 )

      I forsee myself in your shoes in the next few months. Application-level awareness of our key Microsoft applications (Exchange, MOSS, AD, etc..) are very high on our need list, and so SCOM is a natural best-of-breed pick. However, I *really* want a single integrated solution that also covers our unix/linux systems. Is unix/linux monitoring part of your requirements? If so, could you briefly describe the capabilities (and requirements on the monitored systems) that SCOM currently has in this regard?

      • Re: (Score:2, Informative)

        by Anonymous Coward

        SCOM R2 integrates native unix and linux agent [], supported systems are :

        HP-UX 11i v2 and v3 (PA-RISC and IA64)
        Sun Solaris 8 and 9 (SPARC) and Solaris 10 (SPARC and x86)
        Red Hat Enterprise Linux 4 (x86/x64) and 5 (x86/x64) Server
        Novell SUSE Linux Enterprise Server 9 (x86) and 10 SP1 (x86/x64)
        IBM AIX v5.3 and v6.1

        For application awareness, you can check bridgeways management packs [].

      • by blincoln ( 592401 ) on Wednesday July 08, 2009 @07:40PM (#28629995) Homepage Journal

        As an aside, SCOM is a good product, but be sure you have (and are willing to invest) the time to configure it to match your environment. Just because it's also made by MS and has management packs for all of their products doesn't mean you can just flip the on switch and have everything monitored. You will almost certainly be flooded with useless alerts, and not alerted for things that you do care about.

    • by Krneki ( 1192201 ) on Wednesday July 08, 2009 @08:26PM (#28630519)
      I saw SCOM 1 year ago, the hardware requirement for just the client was higher then the whole Cacti server.

      Unless they start to optimize the mess I'm not sure I want to use it.
  • Zabbix (Score:5, Informative)

    by ender- ( 42944 ) <> on Wednesday July 08, 2009 @05:37PM (#28628715) Homepage Journal
    You can also look into Zabbix []. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.
    • Re:Zabbix (Score:5, Informative)

      by TooMuchToDo ( 882796 ) on Wednesday July 08, 2009 @05:44PM (#28628793)
      We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.
      • You manage 2500 servers but a 150GB database is "big"? *confused*
      • Which revision?

        i tried it for a couple of months, and rather like it, but it'd simply stop monitoring stuff, triggers wouldn't fire reliably etc.

        • 1.6

          Haven't had any monitoring problems, we interface to a master Remedy system for stuff due to ITIL, and we have triggers that run scripts that try to fix mundane problems (which usually works).

    • by bigtrike ( 904535 ) on Wednesday July 08, 2009 @08:31PM (#28630563)

      Zabbix allows you to build some fairly powerful rulesets and chains of overrides using its web gui. It's not perfect, but it keeps improving and the attitude of the developers is friendly unlike some of the other projects.

  • GKrellM (Score:5, Funny)

    by Areyoukiddingme ( 1289470 ) on Wednesday July 08, 2009 @05:38PM (#28628729)

    You can pry my GKrellM from my cold, dead hands!

    Yeah, for 5000 devices, the displays start to take up quite a bit of screen space, but that's what video walls are for!


    • Interesting! Do the mods not know what GKrellM is?
      • To be fair, one of the two mods knew it was +11 Funny... er I mean +1 Funny...

        One of today's articles has a thread deploring people who read the article, the summary, the title, or the posts before posting. So in my defense, I only read 4 words of the title and posted, so I saw Would Want Monitoring System. Naturally I thought of GKrellM.

        Now if I had read 4 different words, I'd have thought it was spam and deleted it. "What You Want Large"

    • by gmuslera ( 3436 )
      There are several desktop applets that shows what happens in more "serious" (or at least massive) monitoring solutions. Nagstamon [] shows nagios alarms (and let you ssh/vnc or even see nagios reports onproblematic hosts right there), ZApplet [] shows Zenoss alarms/warnings too.
      • So what you're saying is, GKrellM needs plugins for Zenoss and Nagios and whatever else?

        Damnit, now my +1 Funny is starting to sound practical. Quick, somebody add another +1 Funny! This has to be stopped!

  • by Anonymous Coward on Wednesday July 08, 2009 @05:40PM (#28628753)

    MRTG does it right...most of the others do it wrong
    When rolling up a days worth of data (averaging), you loose the peak information on most monitoring systems
    So your 380Mbps peak that you had an hour ago is fine on today's graph
    But tomorrow, when you look at "yesterdays" graph...the peak is down to 100Mbps
    and next week, when you look at "last weeks" graph...there's a little 50Mbps peak

    Damnit... I want to keep information on my peaks for capacity planning!

  • Zenoss (Score:5, Informative)

    by KerberosKing ( 801657 ) on Wednesday July 08, 2009 @05:40PM (#28628757)

    I was really impressed by Zenoss [], which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.

    You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.

    • Re: (Score:2, Interesting)

      A little tough to setup new SNMP devices, I thought, but overall a great product. Even the free version gets you quite far.

    • Re:Zenoss (Score:5, Informative)

      by rawler ( 1005089 ) <ulrik.mikaelsson@ g m a> on Wednesday July 08, 2009 @06:26PM (#28629283)

      ZenOSS may be great, but a word of warning. We've had 3 failed attempts at implementing it in our shop. What we tried to achieve was mainly host and service-monitoring, with some slight network-monitoring on the side. Nothing fancy, just some 20 hosts, maybe 30 network-devices, and a variety of services.

      One of the major parts we've found missing in most open-source solution was proper event-management (recieving syslog + snmp traps, and apply some intelligence to it regarding flow control, dispatching, archival and that stuff.) ZenOSS is on paper, and throughout the initial evaluation one of the best open source tools to do this.

      However, during our three attempts to get it up and running, we've always encountered some major obstacle (usually after a while of operation), forcing us to start all over from scratch. The problems we had was always in the same category, strange and unexplainable errors, often hard to reproduce, and in general it resulted in a very flaky experience. Some of the problems have been service-checks showing both false positives and false negatives, and in the last problem ZenOSS refused to import new SNMP MIB:s, complaining about some IP-address that could not be found anywhere in the config, and grepping ultimately found the IP to be only present somewhere in the opaque zope-database, where evidently it could not easily be removed, nor even found exactly what the ip-address was for. (It was something auto-discovered in a remote network segment out of our control, but advertised throughout the routers.)

      So, while ZenOSS can do all kinds of things, and does a LOT of things really well, it's extremely complex, not in all parts on solid foundation (such as all network objects in a non-accessible Zope-database that the devs themselves recommends not touching since it may upset things more). If you plan on implementing ZenOSS, I would not go without the support, which I assume is great, since there seems to be quite some dark pits to fall in on your own.

      I dont know how come we had so much obstacles and strange problems when others seem to have a smooth ride. Maybe one explanation is what were the final nail in the coffin for ZenOSS in our deployment. When I started asking around about these problems (and ZenOSS has a really helpful community, no problems there), I realised that many users claimed to have gotten into similar problems that we had, but their solution were to just keep daily backups, and revert to a backup when they ran into these problems. For us, the monitoring data is basis for a lot of 3d-party agreement, and loosing even days worth of monitoring and logging is completely unacceptable due to these reasons. We do backup everything, but in case of rare disasters, and we must be able to rely on the monitoring system giving us a clear view through those disasters.

    • by jon3k ( 691256 )
      Zenoss's commercial support prices [] are hilarious, I mean, literally, hilarious. The CHEAPEST support (silver) is $100 per managed host (including virtualized hosts) most expensive (platinum) is $180 per node. So your 5,000 hosts would be $500,000-$900,000 per year in support.

      Yes. Seriously.

      The other problem I have with Zenoss is the reporting is basically non-existant. It may sound like I'm being hyper-critical, but it's only because I've looked at Zenoss and I so wanted it to be the NMS for me (
      • From the same page you linked:
        >** Volume pricing:Volume discounts, site licenses, and corporate-wide licensing are available for environments larger than 1,000 managed resources. Contact Zenoss Sales to learn more.

        So I call BS on the $500,000-$900,000/year number you gave.

      • You missed the volume and site licensing options available for networks of 1000 nodes+

        Their pricing seems in line with their competition, a quick search finds the follow pricing for HP offerings:

        Network Node Manager, $6,000; OpenView Operation, $17,995; OpenView Internet Services, starts at $12,449 for 5 targets. Additional targets: 5, $2,038; 25, $10,207; 250, $65,160.

        I am not familiar with how their product line works, and I'm sure they also have volume licensing agreements for large customers, bu

        • by jon3k ( 691256 )
          Are we comparing software prices to annual support costs? Those are ANNUAL costs for Zenoss. Even a 50% discount on 5,000 nodes would be a quarter of a million dollars for their LOWEST level of support.

          Next, Zenoss's offering is not comparable to OpenView, sorry. HP OpenView is a massively more mature and robust product, and is really in a totally different league than Zenoss.
        • Re:Zenoss (Score:5, Interesting)

          by Ranger Rick ( 197 ) * <<slashdot> <at> <>> on Wednesday July 08, 2009 @08:46PM (#28630713) Homepage

          And this is why we (OpenNMS) don't play the per-node. It's not any harder to run OpenNMS when managing 1000 nodes than when managing 100, you only need to scale hardware appropriately. Per-node pricing is an artificial limitation.

          We also don't play the "you get a special price behind closed doors" game, our support prices are public, fair, and the same for everyone [] -- and that's only if you need commerical support -- our prices are $0 if you don't need or want support.

          If you do the math, it's $0 for the software, plus $14,995/year for support for any number of nodes, and the software is 100% open-source and fully capable of replacing or exceeding OpenView []. ;)

  • []
    Not sure how far it scales but I have played with it on some small installations, very easy to manage.

    I have used Cacti but never felt it was mature or robust enough for very large environments

    SCOM, System Center Operations Manager we are deploying now for our enterprise, however I would be afraid to manage IT on my own as it is a large system on to it self, yet very powerfull.

  • by AFresh1 ( 1585149 ) <> on Wednesday July 08, 2009 @05:46PM (#28628827) Homepage

    I use Nagios and some custom rolled scripts myself.

    For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga [].

    Reconnoiter [] also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.

  • That is what I would want! ^^

  • I've really been impressed with OpsView. Can't say how well it scales on huge networks (but there are options for having multiple servers). Its based on Nagios, but its a lot less of a pain to configure and has a pretty good web interface. The only thing I don't really like is its graphing functionality. I use Cacti for monitoring bandwidth/server load/etc. But for availability checking OpsView does a fantastic job. I'm using it to monitor maybe twenty devices, including Linux and Windows servers, and
  • Nagios Might Work (Score:3, Interesting)

    by hax4bux ( 209237 ) on Wednesday July 08, 2009 @05:50PM (#28628881)

    I spent last year converting a shop from OpenView to Nagios. They were in the same neighborhood as you (~5000 devices).

    If you do not like the Nagios UI, you could create something else. The native Nagios UI is CGI based and implemented in C. The documentation is good and the sources are well commented.

    The hardest decision about Nagios is how to implement the monitoring. I went w/SNMP (polling, not traps) for the most part. Sorting out all the Nagios plugins is something of a chore and many of them seem incomplete and abandoned.

    MRTG also integrates w/Nagios, which can be useful.

    Good luck.

  • ZenOSS all the way (Score:5, Interesting)

    by Midnight Warrior ( 32619 ) on Wednesday July 08, 2009 @05:52PM (#28628913) Homepage
    We use ZenOSS [] exclusively at work and have enjoyed every minute of it. Pro's include:
    • 2D map with status of all nodes or submaps, organized by network
    • Application monitoring, with more advanced maps available for purchase (Oracle, JBoss, Cisco) for those things you already paid a lot of money for
    • Performance monitoring via SNMP or other data sources using RRDtool internally which includes graphs linked to each other during zoom in/out or panning
    • Nagios plugins already do some of the heavy lifting
    • Built-in support for watching Windows servers (any metric accessible via WMI)
    • Access control using at least LDAP and Active Directory
    • Secondary data collectors for those networks which are too big for just one central source
    • Highly customizable through Python
    • It has so, so much more than pathetic commercial solutions like OpenView


    • You have to keep your eye on the back end database
    • It still takes a long, long time to tune it to remove noise events
    • If you don't know Python, it can be tough in a few places
    • Proper support is not cheap
  • I want lots of buttons and dials! And flashing lights!

  • The mistake (Score:4, Insightful)

    by vlm ( 69642 ) on Wednesday July 08, 2009 @05:55PM (#28628949)

    The mistake is trying to monitor thousands of devices on a 2-D map. I'll look pretty to the suits, but be useless for the users. Nothing but endless slow clicky clicky clicky.

    Give them a text screen of whats currently down ... that'll work.

    • by Krneki ( 1192201 )
      No, a text screen doesn't give the idea to the help desk of what zones are affected.

      I want the help desk to know what is the problem before our clients calls us.
  • by Cato ( 8296 ) on Wednesday July 08, 2009 @05:59PM (#28629017)

    Here's a similar thread from a while back that covers most of the options: []

  • Roll you own... (Score:2, Interesting)

    by hofmny ( 1517499 )
    I have looked at Cacti, Nagios, and a few others, but I think rolling your own is easy enough and gives you the best flexibility. You could also use Nagios, or others, for example, and simply pull the results into your own system.

    I built and managed a software system for me previous employer called the SMS (Server Management System). It basically tracked 50 of our web servers, database servers, and Endeca (full text search) farms at data centers spread around the country. It was pretty simple.

    The syst
  • As one of the core devs and large user, I can tell you it scales well, develops easy and has a lot prefab. The system does everything you're asking for. Let me know if you need help or paid support.

  • I'm a software developer and, sadly, my knowledge of hardware systems isn't always what it should be. When I write an application to run on a server and it starts to get slow, I want to know where the bottleneck is. Is my application CPU-bound? I/O-bound? Memory-bound? Do I need more memory? Faster storage? More cores or faster processor speed? Is it the network that's causing the problem? I can usually figure this out using various linux command-line programs like netstat and top and all that, but

  • What I'd like to see is a good monitoring support for JMX-capable Java services.

    It'd be nice to set up an alarm based on time spent in garbage collector in a JVM running our application, for example.

    • Re: (Score:2, Informative)

      by glsiii ( 247900 )

      You certainly want to check ZenOSS out then. Our instance monitors JVMs across our deployment for everything from heap size to open file descriptors to uptime for the jvm specifically-- all of which can be alerted on if desired.

  • I think Nagios should provide a good start.. they've recently added a lot of scalability features. Though it has a high learning curve and all of its configuration is done in text, I've always found it worth the time and effort. I currently use it to monitor services on a couple hundred machines.

    Munin is a bit simpler, but I like the graphs it provides which occasionally are more useful than the data Nagios provides. In some cases, Nagios might tell me that a server went down, but I'd look at Munin and

  • It's *not* open-source, but it IS inexpensive. When I worked at a NOC, we used it to monitor hundreds of routers, switches, mainframes, Tandem systems, UNIX boxes, etc. It takes SNMP traps and displays them graphically on a 2D map, and the 2D map is very nicely implemented. You can have your top level view made up only of groups of devices, so if a group goes red you double-click that group to view its members and see which device actually has the error. IIRC, you can nest groups, so it ends up being a

  • I Hate War Rooms (Score:5, Interesting)

    by afabbro ( 33948 ) on Wednesday July 08, 2009 @06:33PM (#28629355) Homepage

    I really don't like the "War Room" video wall concept. I suspect such walls are made to look cool rather than to monitor.

    What you want in large-scale monitoring is:

    • The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".
    • I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.
    • I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.
    • I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.
    • Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.
    • I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.
    • I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.
    • I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.
    • I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.
    • I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.

    Etcetera. These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas.

    • Hell yeah. Build it, I'll come.
    • Re: (Score:3, Informative)

      by rossz ( 67331 )

      The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".

      When you have your parent/child relationships and your dependencies set up properly, Nagios does this very well. A properly configured Nagios system will alert you only for that switch that died, not for the 200 services behind tha

    • Re: (Score:3, Informative)

      "What you want in large-scale monitoring is:"

      Let's see:

      "The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A."

      Nagios do this. I know, I configured mine that way.

      "Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down"."

      Nagios can do that, while I never deployed it that way.

      "I want my monitoring solution to understand HA and service degredati

  • by duffbeer703 ( 177751 ) on Wednesday July 08, 2009 @06:37PM (#28629399)

    Focus on usability and rapid deployment rather than wide-ranging featuresets that sit on the shelf for a decade. Nearly all products in this space really, really suck.

  • Big fan of intermapper ( ... It can use nagios plugins as well.
    It's fairly cheap.. We monitor about 1250 devices at the moment with it... can be set all way down to 5 seconds.
    Server and Client are both in Java... so more or less it runs on any platform.

    They give out 30 day demo keys.

  • Cacti w/plugins (Score:3, Interesting)

    by Linegod ( 9952 ) <> on Wednesday July 08, 2009 @06:47PM (#28629499) Homepage Journal

    I use Cacti, with THold [] and weathermap [] plugins.

    But then I'm biased.

  • Then I advise you to let all the connection checks do by a machine dedicated for this task.
    The statistical checks on each of the 5000 machines itself 24 hours a day while pushing/pulling data at intervals to that dedicated machine.
    I advise 2 or more measuring machines each on a separate network each doing the same task and synchronizing their data for redundancy.

  • You can always try INM ( []).

    It's quite feature rich and it's worth a look.

  • OpenNMS (Score:3, Insightful)

    by ckaminski ( 82854 ) <(moc.redochtrad) (ta) (mapson-todhsals)> on Wednesday July 08, 2009 @06:59PM (#28629629) Homepage
    Was a step above Nagios in terms of reliability (I didn't have to restart the server four times a day just to keep it running), and did much better at autodiscovery.

    That fact that it is also NRPE compatible was a plus - I could use all the Nagios plugins and check scripts I'd written.

    I was also planning on using it to launch a more aggressive webmin-style management solution - since OpenNMS built this great database of data about my devices and hosts, I could use it to do actual management - change data/settings.

    Cons: It's a Java/Tomcat tool, as much as that is really a con. It's not like you need to run Jboss or Websphere to use it (though I suppose you could).
  • I just did a quick survey and evaluation of the open source monitoring-market for my company, and found a few shortcomings/frustrations in a few aspects where none of the evaluated system seems to get it 100% right.

    Transparent Planned Design
    Many solutions out there seems to have been developed in what can only be described as an "organic" process. I.E. a few scripts were used from start, were hooked up with some other scripts, were slammed into a web-interface, got some more features, then something central were ripped out and replaced to allow yet more features and so on and so forth. (Read: Nagios) While this is of course often the best way to get something working for a particular need, and on a tight budget, it makes adoption really hard unless you happen to have exactly the same need.

    Event management
    Does anyone know a solution that can both receive from syslog and decode traps with a given MIB, and then do some simple logic, like squashing repeats, displaying on a web-page with archival-options, and dispatch to mail/sms based on configurable rules? Except for ZenOSS (and ZenOSS have other problems), I haven't found a single sensible system that does this.

    Modularity/Seamless Integration
    Since much of the monitoring systems out there doesn't seem to have a clear design, it's often very hard to add missing features. I.E. project X missing an event manager, or is the builtin not satisfactory? No probs, I'll just, ehh, where does this wire come from? Is this really a socket? Did anyone really connect that? It's ok with blackbox-solutions, as long as they serve all my needs, and have clear interfaces to combine with other solutions that serves related needs, but sadly no solution evaluated does everything we need it to and we end up struggling with manual routines to compensate for it.

    There are a few really neat systems that does almost everything one can ask for. (Short of flying cars). Unfortunately, the ones we've tried have always turned out to be very complex, and also do a lot of things we didn't want. Since it's then often not very modular, it hard to get it stop doing the things we don't want, or change the things we need implemented slightly differently. Also the huge codebase that comes along with trying to scratch everyones itch seems to get it's share of bugs, and troubleshooting in large more or less opaque systems is not a fun task.

    The Perfect Monitoring System
    After evaluating all options we could find, we've come to the conclusion that none of the systems we've looked at or tested really fits our needs (Although ZenOSS came close, we encountered just too many bugs and oddities to keep investing time in it). Furthermore, we could not find a combination of systems that integrates well, and together fits our needs, which I personally see as a bigger problem.

    What I would really want to see in the world of Open Source Monitoring, is an eco-system of monitoring apps with an overarching design/architecture. Design a framework where different entities and steps in the monitoring are clearly defined and interfaced with each other, but still allows for differing implementations, and integration with unforeseen needs. For example, at our shop, we continuously analyze roughly 700mbit of streaming video for availability and quality. Noone designing a monitoring system could probably forsee this as an appliance, but in The Perfect Monitoring System, it should be clear for the average-skilled hacker how to integrate it.

  • As you've discovered, the free systems will fall over and die once they're past a certain size. I've worked with Tivoli customers that have tens of thousands of servers, and for all of its problems, Tivoli is scalable.

    They way they do it is, obviously, divide and conquer. There are specific ways that they do it. I'm mixing up the architecture and terminology on purpose, because the Tivoli terminology will confuse you.

    * there are agents on every box that do the monitoring
    * they report to a region
    * those regi

    • Re: (Score:3, Interesting)

      by Krneki ( 1192201 )
      I don't believe in agents. I refuse to install anything not needed on the server. SNMP should be enough for all the information, unfortunately this is not the case. So I use WMI and netbios querying.
  • How about... (Score:5, Informative)

    by yacoob ( 69558 ) on Wednesday July 08, 2009 @07:20PM (#28629827) Homepage

    Needed features in random order:
    * Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
    * Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
    * Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
    * Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
    * Some language to easily calculate derivative values from the data above.
    * Interface for defining graphs, using collected data.
    * ...and a system for annotating the above. Raw data is neat, annotated data is even better.
    * Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
    * (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
    * (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
    * hooks/libraries to use collected data "outside" of the system

    I realize that's a lot, but boy, such system would be very useful and flexible.

  • Monitoring on scale (Score:3, Interesting)

    by C_Kode ( 102755 ) on Wednesday July 08, 2009 @07:23PM (#28629849) Journal

    If your monitoring something of that scale, you should probably look into a profession solution.

    I use Zenoss (open source) and like it quite a bit. It takes time to customize for your setup, but unless you have a bland network, that is almost always the case. I will say this, it's much easier to setup the Nagios was a couple of years ago when I was using Nagios. Though I've heard there has been some improvement.

  • by dlapine ( 131282 )

    If you want a monitor that can display useful information about thousands of nodes on a single display try clumon []. We use it for our 1000+ node clusters. The software was developed in-house but is available under the University of Illinois/NCSA Open Source License Copyright (noticeware). If you're just going to use this in-house, the license shouldn't be an issue.

    You can see a sample clumon display of a working cluster at NCSA Linux Cluster Monitor [].The clumon page for that cluster shows you each the job s

  • I know I'm late to the party, but I haven't seen anyone bring this one up yet: Real-world alarm/notification capability (pager, buzzer, a machine that goes bing, [] something like that)

    My reasoning: I run a small IT business with various support contracts. I, and probably quite a few others, can't afford to pay someone to sit at a monitor and watch a screen (or a bunch of screens) whilst tied to a desk.

    Most of the monitoring solutions (Nagios, others) are capable of off-site notification, but it's the "last

    • Real-world alarm/notification capability (pager, buzzer, a machine that goes bing, [] something like that)

      ...sorry, I mentioned "pager" as a real world alarm, then panned the idea--I meant "pager" as in "Mr. Jones, please check the server logs..."

  • We are using [] for our monitoring. It is working pretty well, and we can use existing Nagios scripts with it.

  • by scarolan ( 644274 ) on Thursday July 09, 2009 @10:03AM (#28635893) Homepage

    We used OpManager in production for over a year. It has terrible Linux support. None of their built-in plugins worked properly for monitoring even basic parameters like disk space, free memory, CPU usage, etc. When we pointed this out to their support people, they said we should build our own plugins with SNMP OIDs. Not for the amount of money we paid for that steaming POS. We finally kicked OpManager to the curb about a month ago, and have our entire environment, Windows and Linux servers being monitored with Nagios. Nagios scales well, we are currently watching several hundred hosts and about 3500 services.

    OpenNMS is also a good tool, its ability to map servers back to switch ports is extremely handy.

We gave you an atomic bomb, what do you want, mermaids? -- I. I. Rabi to the Atomic Energy Commission