Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Networking Software IT

What Would You Want In a Large-Scale Monitoring System? 342

Krneki writes "I've been developing monitoring solutions for the last five years. I have used Cacti, Nagios, WhatsUP, PRTG, OpManager, MOM, Perl-scripts solutions, ... Today I have changed employer and I have been asked to develop a new monitoring solution from scratch (5,000 devices). My objective is to deliver a solution that will cover both the network devices, servers and applications. The final product must be very easy to understand as it will be used also by help support to diagnose problems during the night. I need a powerful tool that will cover all I need and yet deliver a nice 2D map of the company IT infrastructure. I like Cacti, but usually I use it only for performance monitoring, since pooling can't be set to 5 or 10 sec interval for huge networks. I'm thinking about Nagios (but the 2D map is hard to understand), or maybe OpManager. What monitoring solution do you use and why?"
This discussion has been archived. No new comments can be posted.

What Would You Want In a Large-Scale Monitoring System?

Comments Filter:
  • Hyperic HQ (Score:2, Informative)

    by Anonymous Coward on Wednesday July 08, 2009 @05:31PM (#28628639)

    Hyperic HQ may be worth checking out.

  • OpenNMS (Score:5, Informative)

    by Anonymous Coward on Wednesday July 08, 2009 @05:34PM (#28628669)

    That's all you should need. For 5000 devices I don't know that any of the options you listed would be appropriate.

    OpenNMS is much more than monitoring, but I think that you'll appreciate the other features as well.

    http://www.opennms.org/

    Enjoy.

  • Zabbix (Score:5, Informative)

    by ender- ( 42944 ) on Wednesday July 08, 2009 @05:37PM (#28628715) Homepage Journal
    You can also look into Zabbix [zabbix.com]. It's open source, and has Enterprise support available. I haven't used it yet, but as soon as I have a spare moment to breath I intend to test it out for use in my environment.
  • Zenoss (Score:5, Informative)

    by KerberosKing ( 801657 ) on Wednesday July 08, 2009 @05:40PM (#28628757)

    I was really impressed by Zenoss [zenoss.com], which has all the slick features that cost the earth from vendors like HP for Openview. You get automatic discovery, CMDB inventory, availability monitoring, alerting, and performance graphs all in a web portal.

    You get open source, commercial support, and a good community of users and plug-in developers. The best of both worlds IMHO.

  • Re:Zabbix (Score:5, Informative)

    by TooMuchToDo ( 882796 ) on Wednesday July 08, 2009 @05:44PM (#28628793)
    We use Zabbix in a production environment with 2500+ servers and tens of thousands of monitored items. The database will get big (currently at 150GB) but everything works like a champ, monitored at 1min intervals.
  • by glassware ( 195317 ) on Wednesday July 08, 2009 @05:46PM (#28628823) Homepage Journal

    He said he was asked to "develop a new solution" - which most likely means he gets to pick and choose what to implement, whether parts of it are custom developed or off the shelf. I would imagine a good solution would be a core product plus custom built extensions for the features he needs that the product doesn't implement itself.

  • by AFresh1 ( 1585149 ) <andrew+slashdot@ ... .com minus punct> on Wednesday July 08, 2009 @05:46PM (#28628827) Homepage

    I use Nagios and some custom rolled scripts myself.

    For some other options, Nagios has now been forked, so if that is "close" to what you want, you may want to contribute to Icinga [icinga.org].

    Reconnoiter [omniti.com] also looked pretty kewl, but they haven't released anything yet, but it looks like they are planning it to be very scalable.

  • by Krneki ( 1192201 ) on Wednesday July 08, 2009 @05:48PM (#28628853)
    Exactly, I need a good core product that I'll evolve over time.
  • by Cato ( 8296 ) on Wednesday July 08, 2009 @05:59PM (#28629017)

    Here's a similar thread from a while back that covers most of the options: http://linux.slashdot.org/article.pl?sid=07/03/05/1812247 [slashdot.org]

  • by Anonymous Coward on Wednesday July 08, 2009 @06:06PM (#28629091)

    SCOM R2 integrates native unix and linux agent [microsoft.com], supported systems are :

    HP-UX 11i v2 and v3 (PA-RISC and IA64)
    Sun Solaris 8 and 9 (SPARC) and Solaris 10 (SPARC and x86)
    Red Hat Enterprise Linux 4 (x86/x64) and 5 (x86/x64) Server
    Novell SUSE Linux Enterprise Server 9 (x86) and 10 SP1 (x86/x64)
    IBM AIX v5.3 and v6.1

    For application awareness, you can check bridgeways management packs [bridgeways.ca].

  • by Krneki ( 1192201 ) on Wednesday July 08, 2009 @06:23PM (#28629245)
    I use Cacti for this and it's fantastic.
  • Re:Zenoss (Score:5, Informative)

    by rawler ( 1005089 ) <ulrik.mikaelsson ... m ['gma' in gap]> on Wednesday July 08, 2009 @06:26PM (#28629283)

    ZenOSS may be great, but a word of warning. We've had 3 failed attempts at implementing it in our shop. What we tried to achieve was mainly host and service-monitoring, with some slight network-monitoring on the side. Nothing fancy, just some 20 hosts, maybe 30 network-devices, and a variety of services.

    One of the major parts we've found missing in most open-source solution was proper event-management (recieving syslog + snmp traps, and apply some intelligence to it regarding flow control, dispatching, archival and that stuff.) ZenOSS is on paper, and throughout the initial evaluation one of the best open source tools to do this.

    However, during our three attempts to get it up and running, we've always encountered some major obstacle (usually after a while of operation), forcing us to start all over from scratch. The problems we had was always in the same category, strange and unexplainable errors, often hard to reproduce, and in general it resulted in a very flaky experience. Some of the problems have been service-checks showing both false positives and false negatives, and in the last problem ZenOSS refused to import new SNMP MIB:s, complaining about some IP-address that could not be found anywhere in the config, and grepping ultimately found the IP to be only present somewhere in the opaque zope-database, where evidently it could not easily be removed, nor even found exactly what the ip-address was for. (It was something auto-discovered in a remote network segment out of our control, but advertised throughout the routers.)

    So, while ZenOSS can do all kinds of things, and does a LOT of things really well, it's extremely complex, not in all parts on solid foundation (such as all network objects in a non-accessible Zope-database that the devs themselves recommends not touching since it may upset things more). If you plan on implementing ZenOSS, I would not go without the support, which I assume is great, since there seems to be quite some dark pits to fall in on your own.

    I dont know how come we had so much obstacles and strange problems when others seem to have a smooth ride. Maybe one explanation is what were the final nail in the coffin for ZenOSS in our deployment. When I started asking around about these problems (and ZenOSS has a really helpful community, no problems there), I realised that many users claimed to have gotten into similar problems that we had, but their solution were to just keep daily backups, and revert to a backup when they ran into these problems. For us, the monitoring data is basis for a lot of 3d-party agreement, and loosing even days worth of monitoring and logging is completely unacceptable due to these reasons. We do backup everything, but in case of rare disasters, and we must be able to rely on the monitoring system giving us a clear view through those disasters.

  • Re:JMX Support (Score:2, Informative)

    by glsiii ( 247900 ) on Wednesday July 08, 2009 @07:11PM (#28629737)

    You certainly want to check ZenOSS out then. Our instance monitors JVMs across our deployment for everything from heap size to open file descriptors to uptime for the jvm specifically-- all of which can be alerted on if desired.

  • Re:Zabbix (Score:5, Informative)

    by BlueBlade ( 123303 ) <mafortier&gmail,com> on Wednesday July 08, 2009 @07:18PM (#28629793)

    We're using Zabbix at work and I'm doing daily backups of the database with a simple mysqldump command. Since the tables are InnoDB and not MyISAM, you can use the --single-transaction switch. That way, it takes a virtual snapshot of the db at the start of the backup process and the writes can still keep going (they are still happening but they aren't commited until the transaction finishes). Granted, our DB isn't that big (10GB only), but it's been working fine and restore tests also seem to work fine.

    Here's the daily cron:

    mysqldump -u blah -pblah --single-transaction --opt --skip-lock-tables zabbix | gzip > /backup/zabbix_db.sql.gz

  • How about... (Score:5, Informative)

    by yacoob ( 69558 ) on Wednesday July 08, 2009 @07:20PM (#28629827) Homepage

    Needed features in random order:
    * Scalability - few k machines is minimum. This probably means smart, decentralized collection and aggregation of data.
    * Flexible whitebox monitoring - for given class of devices, I should be able to configure how to fetch this device's data (http, smnp, ssh+command, rpc, you-name-it) and how to interpret it ("read the status page there, get this and that value").
    * Flexible blackbox monitoring - for given class of devices, I should be able to configure a set of actions that should be performed on it (fetch a page, ssh into, ping) and how results of that action should be interpreted (ok/nok, time to complete, etc.).
    * Easy way to tag (source/machine/network segment) and aggregate (max/min/mean/stddev/%ile/sum) of the monitoring data.
    * Some language to easily calculate derivative values from the data above.
    * Interface for defining graphs, using collected data.
    * ...and a system for annotating the above. Raw data is neat, annotated data is even better.
    * Alerting subsystem, which should allow for defining different destinations, together with escalation rules. And custom alerts - using the .
    * (nice to have) HTTP server with a simple HTML templating, to allow for easy creation of arbitrary dashboards.
    * (if you have the above) predefined templates for most of common things. Both detailed ("everything about device X") and general ("if the background of the page is green, you're fine! If it's not, here you'll find a concise list of what's broken").
    * hooks/libraries to use collected data "outside" of the system

    I realize that's a lot, but boy, such system would be very useful and flexible.

  • by blincoln ( 592401 ) on Wednesday July 08, 2009 @07:40PM (#28629995) Homepage Journal

    As an aside, SCOM is a good product, but be sure you have (and are willing to invest) the time to configure it to match your environment. Just because it's also made by MS and has management packs for all of their products doesn't mean you can just flip the on switch and have everything monitored. You will almost certainly be flooded with useless alerts, and not alerted for things that you do care about.

  • by Krneki ( 1192201 ) on Wednesday July 08, 2009 @08:26PM (#28630519)
    I saw SCOM 1 year ago, the hardware requirement for just the client was higher then the whole Cacti server.

    Unless they start to optimize the mess I'm not sure I want to use it.
  • Re:OpenNMS (Score:3, Informative)

    by Euan Buchanan ( 639454 ) on Wednesday July 08, 2009 @08:29PM (#28630545)
    OpenNMS offers excellent training and support. My company flew Tarus down to Autralia for a week where he implemented OpenNMS across our four sites in the first day, then spent four days with myself and a colleague training us on its features. The price, including business class flight from America (it's a lonnnng flight and we wanted him semi-conscious on arrival) was absolutely trivial when compared against just the licensing costs of a comparable proprietary product such as HP OpenView. Highly recommended.
  • Re:OpenNMS (Score:3, Informative)

    by Ranger Rick ( 197 ) * <slashdot@racc o o n f i n k .com> on Wednesday July 08, 2009 @08:48PM (#28630723) Homepage

    That would definitely be a bug. I'll look into it... :)

  • by Grincho ( 115321 ) on Wednesday July 08, 2009 @09:14PM (#28630951) Homepage

    To be fair, I wouldn't say the Zope database (ZODB) is not a "solid foundation". It's one of the best parts of the Zope stack and, in 3 years of dozens of clients using it in Zenoss, Plone, and other apps, I've never had it corrupt or lose any data. It's a proper DB--ACID, MVCC, and all that--and you can even lop transactions off the storage to go back in time. Don't expect it to be a relational DB with the ad hoc query tools typical thereof; it's an object DB, with the aim of persisting graphs of Python objects transparently.

    Now, if you aren't familiar with it, the ZODB can indeed seem opaque, but, just like any DB, there are tools to read and modify it. At the highest level, just stick "manage" after your Zenoss URL, e.g. http://example.com/zport/dmd/manage [example.com] . That'll get you into the web-based Zope Management Interface (colloquially, "the ZMI"), where you can poke around at any object that someone's bothered to write a UI for. Deeper than that, you can connect to ZEO (a server that brokers access to the ZODB over a socket) and mess with the object graph using normal Python. When you're done, "import transaction; transaction.commit()". (The Zenoss developers are probably trying to scare you away from such digging around in fear that you'll violate their objects' invariants and leave them a real mess to solve.)

    Now, I don't say that Zope isn't scary; it has over 10 years of scary stored up in it. But the ZODB is a cuddly, loving part.

    Cheers!

  • Re:I Hate War Rooms (Score:3, Informative)

    by rossz ( 67331 ) <ogre&geekbiker,net> on Wednesday July 08, 2009 @09:40PM (#28631145) Journal

    The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A. Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down".

    When you have your parent/child relationships and your dependencies set up properly, Nagios does this very well. A properly configured Nagios system will alert you only for that switch that died, not for the 200 services behind that switch that you can't reach.

    I want my monitoring solution to understand HA and service degredation. I want programmable rules about what happens when X is down or Y is down.

    There's a 'cluster" plugin for nagios available, but I consider it a hack for something that should be inherently supported.

    I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc.

    Nagios could be improved here. I can set it up to fire off a script when a hard failure is detected and do something different, e.g. HUP apache, but there isn't a way to directly configure alternative test options.

    I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards.

    You can configure Nagios for how many times you want an alert to fire.

    Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice.

    Supported in Nagios.

    I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed.

    That would require specialty software be running on the system being monitored. Not exactly feasible when dealing with every type of equipment imaginable.

    I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them.

    Every monitoring system I've used supports this.

    I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out.

    That would require some kind of custom hook into your ticketing system. The monitoring system needs to have an open API for injecting commands so that anyone could write their own script. I know Nagios can do this. I don't know about others.

    I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice.

    Nagios does reports, but I feel they have lots of room for improvement.

    I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately.

    This would require information you suggested above regarding the API.

    I'm a big fan of Nagios, but realize it has room for improvement.

  • Re:I Hate War Rooms (Score:3, Informative)

    by turbidostato ( 878842 ) on Wednesday July 08, 2009 @09:43PM (#28631173)

    "What you want in large-scale monitoring is:"

    Let's see:

    "The ability to map complex relationships. I don't want 50 alerts that I can't reach host X, host Y, etc. I want one alert that I can't reach router A."

    Nagios do this. I know, I configured mine that way.

    "Even better, I want to map things so that I can say "end user application XYZ is not accessible in Kansas due to X being down"."

    Nagios can do that, while I never deployed it that way.

    "I want my monitoring solution to understand HA and service degredation."

    Nagios do this. I know, I configured mine that way.

    "I want programmable rules about what happens when X is down or Y is down."

    Don't know what exactly do you mean, but if you mean the ability to automatically trying to recover a failing state, I think Nagios can do that. Not that I would want go that path: I'm quite averse to "self-healing" systems and prefer early and meaningful alerts and then push brains into it.

    "I want many options for escalation. If X doesn't acknowledge, try Y after 15 mins, etc."

    Nagios do this. I know, I configured mine that way.

    "I don't ever, ever want a pager to explode or be flooded. A problem should be noticed once and tracked. There should be no pager blizzards."

    Nagios do this. I know, I configured mine that way.

    "Of course, I don't want this thing relying on my mail system for paging because, of course, my mail system could go down. An ability to dial out if the mail system is down would be nice."

    Nagios do this. I know, I configured mine that way.

    "I want agents, hooks, interfaces, third-party add-ons, and every possible way of tying something into the monitoring system. I don't want dumb limitations like "you can only get an exit code from the OS and it acts on that" or something. For big monitoring, it's almost mandatory that some kind of API for agents is exposed."

    Don't know what exactly you mean, but I know you can extract out of Nagios the very same information it is managing though i.e. a socket or push it to a database for further inspection. Apart of this, it's open sourced, you know.

    "I want "I'm working on it, stop paging" blackouts. I want to be reminded to lift them."

    Nagios do this. I know, I configured mine that way (well, to autolift the "I'm working on it" after Nagios detects the service on-line again... and it automatically closes the previously opened ticket on our service desk too).

    "I want it to tie into my change-management system. If I open a ticket and say that server X is down for 2 hours on this date, I don't want to have to remember to black it out."

    Nagios do this. I know, I configured mine that way.

    "I want reports. I don't care about silly little charts and graphs, but a history of everything that has every gone wrong with device Y would be nice."

    Nagios do this. I know, I configured mine that way. It is the "little charts and graphs" where Nagios is quite lacky, in fact.

    "I want more info on my page-receiving device than just "HOST X IS DOWN". I want context so I can decide if I have to drop everything immediately."

    Nagios do this. I know, I configured mine that way.

    "These are some of the things that make sane large monitoring systems. I don't think any open source product has all of them, alas"

    You didn't research on this that so much. Please note that I'm not affiliated to Nagios in any way, but I'm just a satisfied user.

  • by Anonymous Coward on Thursday July 09, 2009 @06:19AM (#28634103)

    RTFM !

    http://oss.oetiker.ch/mrtg/doc/mrtg-reference.en.html

    WithPeak

  • by scarolan ( 644274 ) on Thursday July 09, 2009 @10:03AM (#28635893) Homepage

    We used OpManager in production for over a year. It has terrible Linux support. None of their built-in plugins worked properly for monitoring even basic parameters like disk space, free memory, CPU usage, etc. When we pointed this out to their support people, they said we should build our own plugins with SNMP OIDs. Um....no. Not for the amount of money we paid for that steaming POS. We finally kicked OpManager to the curb about a month ago, and have our entire environment, Windows and Linux servers being monitored with Nagios. Nagios scales well, we are currently watching several hundred hosts and about 3500 services.

    OpenNMS is also a good tool, its ability to map servers back to switch ports is extremely handy.

  • by gudmo ( 920200 ) on Thursday July 09, 2009 @12:34PM (#28638167)
    The solution is real simple. If you can program in anything then Hobbit/Xymon with Devmon is your only choice.
    Create your own Weather Map for 2D, you never need a full 2D map of 5000 hosts... Less is more.

    1. Free
    2. Fully customizable
    3. Easy administration
    4. Offers clients for all the major OS (And quite a few minor ones)
    5. Large support base (Users with high technical level)
    6. Nice author (Replies to comments and considers all ideas)
    7. You can write a test for anything you can think of and easily add it into hobbit
    8. Offers client/server montoring, remote monitoring, script monitoring, snmp monitoring(devmon) or scripts

    The possibilities with Hobbit are endless

    Personally I use Hobbit to monitor over 2400 devices, including Cisco hardware, AIX, Windows Servers, VMware Clusters, Exchange, Sharepoint etc.etc.etc.etc.
    I've never encountered a system I could not monitor with Hobbit (Or scripts that send their results into hobbit).
  • Just my $0.02. (Score:2, Informative)

    by wr37chd00d ( 893862 ) on Thursday July 09, 2009 @06:37PM (#28643585)
    We are using a combination of Cacti and mon [kernel.org] to monitor about 200 devices, both network gear and PC servers. Cacti is used to graph performance data(bandwidth, cpu, mem, temp) and maps for the visually inclined, while mon is used to do the actual service monitoring and alerting.

    I won't comment on Cacti, since it has been mentioned here already, though iI will say that you CAN change the default behavior of "sample averaging" by increasing the size of the RRD database. There are discussions on the Cacti forum/wiki that cover this topic.

    Mon on the other hand, I didn't see mentioned at all, so here's my blurb on that. The core of mon is a scheduler written in perl, which handles running monitor tests(also perl or any script/program that can exit with a 1/0) and then alerting(also perl, or other languages, and can do more than just sending mail or paging) when necessary, based on the configuration for that service. Like most open source projects, it is extremely flexible, if you have the initial time investment to set up your tests and dependencies correctly, but once this is done, the tests/alerts can be reused, or further modified. There are quite a few monitor tests and alert scripts already included, along with some handy tools for interaction through a web browser(via moncgi), generating dependency trees, generating reports, and more. Theres also a perl module, Mon::Client, that provides an API for interacting with the mon scheduler. The downside, besides configuring it with a text file(m4 can be helpful here), is there hasn't been any activity since 2007(according to the CVS repo on sourceforge).

    Probably not the solution for an extremely large number of hosts, though resource-wise, it could handle it, but maybe someone else might be able to benefit from it. If you need very specific tests(number of BGP routes, verifying NH on routes, customer redundancy) and smart alert logic, it's worth looking at.
  • by Anonymous Coward on Thursday July 09, 2009 @07:49PM (#28644407)

    I would suggest you give GroundWork a go. It an amalgamation of all the best open source monitoring tools previously mentioned in these comments such as Nagios and Cacti but they are fully integrated into one interface and reduces complexity.

    GroundWork Open Source uniquely combines the most mature and successful open source projects available today into a single package. These amalgamated projects have been downloaded over the last decade for more than 4 million times and have a strong codebase and a strong community behind them.

    Combining these projects into a single package that is commercially supported gives you a simplified deployment experience, a single console for managing and monitoring, and a comprehensive view of your IT operations efficiency.

    Other GroundWork Monitor advantages include open APIs and an open event manager so the information collected by GroundWork Monitor can be biâ"directionally shared with your other ITSM systems such as asset tracker, ticketing system or a CMDB.

    But the best part about GroundWork Monitor is the low cost for the enterprise IT management system and its fast return on investment and value.

    GroundWork amalgamates and supports established, mature projects including:

            * Nagios® â" for event handling and notification
            * SNMP and SNMPâ"TT â" for network management protocol
            * RRDtool â" for underlying data collection and management
            * BIRT â" for adâ"hoc reporting
            * Ganglia â" for grid and cluster monitoring
            * Cacti â" for graphing and trending

    Each of these projects and more are included with each GroundWork Monitor edition tier. GroundWork Monitor has three editions: Enterprise, Professional and Community Edition. All editions are based off of the same codebase so trade-up is easy.

  • Try NETMON (Score:1, Informative)

    by Anonymous Coward on Thursday July 09, 2009 @08:43PM (#28644817)

    It's based on Debian, you are able to run scripts as an action, can page, email, has a Postgresql backend (Very easy to backup). It can do SNMP v1-3, Syslog, Traps, Portmonitoring graphs, monitor CIFS/NFS volumes, Linux/Windows services, etc

    http://www.netmon.ca

Intel CPUs are not defective, they just act that way. -- Henry Spencer

Working...