Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Networking

Network Monitoring and Alerting? 59

SpamMonkey asks: "At work I am trying to implement a central monitoring and alerting service. We have in excess of 250 Windows servers, approx 15 AIX servers and another 30 Linux servers (mainly SLES/Suse). My investigation into systems that will allow us to monitor critical areas on each of these systems has so far led me to a clustered Linux server running Nagios with passive and active checks. What I'm curious about though is how Slashdot readers are carrying out their own jobs and how they can comfortably sit back, without having to repeatedly check that various systems are still operational and how to cut down their own response times when something goes wrong."
This discussion has been archived. No new comments can be posted.

Network Monitoring and Alerting?

Comments Filter:
  • Why do you need a cluster?
    • Re:Why? (Score:3, Interesting)

      by Asgard ( 60200 )
      If the monitoring server goes down then no pages will go out. In thise case the poster may have meant 'redundant servers' instead of a true cluster. One can configure a Nagios environment such that two systems check on each other, and if the primary Nagios instance goes down the secondary Nagios instance will take over monitoring and alerting.
  • Big Brother (Score:4, Informative)

    by one8zero ( 832111 ) on Monday February 28, 2005 @07:26PM (#11807931)
    Used at my Fortune 500 company to monitor thousands of Win/Linux/Solaris. http://www.bb4.org/
  • Simple (Score:2, Interesting)

    No sitting back.

    Security/monitoring is a process not a product. When you finish the checklist you start over.

    Also, i would recommend trying to cut back the windows server's somehow, maybe mention the right words when the licenses expire or its that time of the year to upgrade to a "free" solution?
    • He's not talking about security monitoring. He's talking about: "This machine is about to run out of disk space, someone should really fix it before it actually does.", "This machine isn't running Apache, and it should be". This machine isn't responding to ping and it should be. I run Nagios (the new name of NetSaint), and it's wonderful for this.

      He's talking about day to day alerts to notify if a machine's services aren't working when they should be. That'd be SNMP alerts, MRTG, something akin to Na

  • Nagios (Score:3, Insightful)

    by codejnki ( 16214 ) on Monday February 28, 2005 @07:32PM (#11807998) Homepage
    I implemented Nagios at my work and am very happy with. We are mostly a Citrix shop so rouge processes on windows servers are the bane to things running smoothly. We monitor average CPU load on all our application servers.


    All in all I'm monitoring about 200 different processes across our network as well as running MRTG on the same box. Never felt once I needed to cluster.

    • Re:Nagios (Score:3, Insightful)

      by dubious9 ( 580994 ) *
      I like nagios, but it can be difficult to set up. To obtain the same level of functionality with MS Mom, I had to write a number of plugins. Plugin writing is really easy if you know some perl, it's basically just creating a CLI utility that outputs in a certain format that nagios can understand.

      Remote services can be checked with ease, but stuff that needs to query the local system, disk-space or load for example, needs a different setup. I ran it through SSH, but I'm not sure the kind of load that wo
    • Same here. I'm monitoring 595 services on 99 hosts right now on a single P4 2.8Ghz desktop box, running mySQL and MRTG to monitor about 60 hosts, though it's starting to chung as I'm in the process of moving to Cacti and it's running at the same time.

      For that number of boxes, you shouldn't need to get any cluster, though a second redundant box would make sense incase the first dies.

      As for the Windows machines, check out NRPE-NT. It's a service that'll run on each Windows box and run any required plugins.
  • by antifoidulus ( 807088 ) on Monday February 28, 2005 @07:33PM (#11808001) Homepage Journal
    is still someone calling you up at 3 am screaming at you that the network is down and yelling at you to get your ass in there asap. Works every time!
  • Its Easy. (Score:2, Funny)

    by MistabewM ( 17044 )
    You blame the new hire. If there is more than one new hire you blame the one that spends his waking life playing everquest or .

    As for preventitive mesures, you put the new hire in charge of the task and sit back and relax.

    1st rule of managment is misdirection.
  • Also curious (Score:5, Interesting)

    by RabidMonkey ( 30447 ) <canadaboy.gmail@com> on Monday February 28, 2005 @07:35PM (#11808018) Homepage
    We are in the middle of a large scale Linux project .. replacing 900 SCO Unix servers with Linux. We are wondering the same thing .. what monitoring tools can be put in to watch these servers?

    Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.

    Our solution (And the reason I'm working late right now) is to build a custom suite of tools that does batch reporting every night by polling logs and custom programs, then sends it back in a handy xml file. We take that file, dump it into a large informix database, and then we can do whatever we want to create reports.

    It's a little more work than just installing a package, but we're getting EXACTLY what we want out of the product. It works with our very unique communications and configuration, and it's modular so I can add whatever monitoring/checks I want by writing a new ksh script. All the output is standardized and all the parsing is done at the office by a very clever xml parser one of the db guys wrote.

    I think for whaty ou're looking for, theres things like Big Brother, MRTG, HP Openvie umm ... the IBM equivalant (A Tivoli product I think?). No need to reinvent the wheel.

    But I'd love some feedback for people who are working in a bandwidth sparse shop like me ... how do you balance your need to know everything about every box all the time (well, the business and management want to know at least) vs the tiny amount of data you can push without interfering with mission critical applications running over that same link?

    • Re:Also curious (Score:1, Interesting)

      by MistabewM ( 17044 )
      I would consider piggy backed systems in you case, cluster type server have one server monitor the other. If one goes down alert the ?local? tech staff or alert someone... if you want immediate reporting you can have it jump onto the GSM / TDMA network and have it start blastings sms text messages to your wireless phone.

      If you are using a realtime link you can just have it alert through the NOC.
    • Re:Also curious (Score:2, Insightful)

      Sorry to be shilling for the company I work for, but we can do exactly what you need. Indicative Service Directory [indicative.com] can collect constant data, but then only upload to the central collection server once a day, or even less frequently if desired.
    • Intermapper Remote (Score:3, Informative)

      Our big hitch comes from the fact that we only have a satellite connection to each of the remote sites, so we can't do real time monitoring, so things like HP Openview NNM are out of the question - they use too much bandwidth.

      You might be interested in Intermapper [intermapper.com] for its Remote [dartware.com] component. You can run a monitoring system at each location yet administer them centrally. Your remote datastream will basically be the set of events that's interesting to the Human In Charge.

      It's commercial software, so you ha
    • Among the others mentioned here, also checkout sitescope. While commercial, it's pretty inexpensive compared to most commercial products, but can be used for polling, running scheduled scripts on a multitude of OSes, running locally on the monitored box if you don't want to poll, plus it interfaces with just about any email/paging/alerting system you'd want with customizable alerts, has dependancy checking so that you can get one alert that a router is down rather then 50 alerts that 50 servers behind that
    • I'm in a NOC that is bandwidth sparse to some sites - a lot of satellite links (128-384k.)

      We use Openview. We get at lot of false alarms for 'Node Down', when the alert should read 'bandwidth impaired'. The easily choked 128k link reaches capacity and dumps the low priority traffic. Naturally, the ICMP stuff like ping goes first.

      The IBM equivilant to Openview is Tivoli Netview.
  • by jgaynor ( 205453 ) <jon@@@gaynor...org> on Monday February 28, 2005 @07:41PM (#11808082) Homepage
    So you're concerned about letting things go on auto-pilot and missing an alarm . . .

    Why not slap a modem into the head nagios box and have it page you [nagios.org] when things fail. Don't worry about having to wear a beeper - you can page most cellphones via your carrier's SMS [nagios.org] gateway (still dial-capable).

    Too much hassle? How about AIM [nagios.org]? YIM [nagios.org]? Jabber [nagios.org]? Email [nagios.org]?

    If you're TRULY the teeth jittering, chain smoking NOC type, buy some x10 crap and build a physical network alarm interface [rutgers.edu] like I did :). Either way, nagios has ASSLOADS of event notification options [nagios.org].
  • Monitoring Tools (Score:4, Interesting)

    by Plake ( 568139 ) <rlclark@gmail.com> on Monday February 28, 2005 @07:45PM (#11808109) Homepage
    Personally I've used an array of the free monitoring tools and find most of them be decent. For larger sized monitoring you'd want something that can have the clients push data to the monitoring systems so they do far less work.

    Here's a couple of the monitoring solutions:

    Opennms [opennms.org]

    Mon [kernel.org]

    Big Brother [bb4.org]

    For system information polling I'd go with:

    Cacti [cacti.net] hands down this is the best polling system out there and it's simple to setup and run.

    • I saw OpenNMS at LinuxWorld Boston. It is looking very spiffy. Definetely a sophisticated OSS network monitoring tool.

      Just to clarify for others(I know by your sig you know this already), MON is not so much a network monitoring application as it is a framework with some production ready examples. I was really happy with my last MON setup, but I did put serious time into getting it setup and writing some scripts of my own. FWIW, this was 3 years ago, so maybe it comes with even more good stuff out of th
    • Not so well known as other oss projects:
      http://www.zabbix.com/ [zabbix.com].
      It has its quirks, and it can be a little difficult to set up and get used to the first time, but it does its job well
    • I agree about Cacti [cacti.net]...it was stoopid simple to install and setup. The same can be said about Cricket [sourceforge.net].

      If you are looking at Big Brother, I would probably recommend you look at Hobbit Monitor [sourceforge.net] instead. Hobbit is Open Source and unconstrained as is BB. They are developing a client-side piece to replace BBNT, as well. Hobbit extends the good things we like about BB and adds some other things we would have really liked to have SEEN in BB.

      It seems to me, also, that BB hasn't enjoyed much development activity for
    • We like the umbrella approach where the package can monitor the technology as well as the business processes that depend on it. It adds great value and gets the whole company behind your efforts - HP can do it, but if you like the quick and easy yet still comprehensive route, look at http://www.viewitpro.com/ [viewitpro.com]
  • Depending on what you are doing, but for that size of deployment, give Indicative [indicative.com] a call/email.

    I do testing for them, and we really do have a very cool system going. We monitor the systems, the network, and the applications. We have pre-defined tests for almost anything you could think of from simple http tests to fancier cisco router tests, to perfmon integration. We monitor Weblogic and Websphere, and soon are going to be able to monitor other app servers as well.

    So ya, give us a call and inquire
  • I use IPSentry [ipsentry.com] to monitor our servers. It's cheap, easy to use, and can check just about anything you'd want to. You can even write your own monitoring and alerting plugins if that's your thing.
  • polling and whatnot (Score:2, Informative)

    by bendsley ( 217788 )
    At work, we use a program called JFFNMS (Just For Fun Network Monitoring System). It's come a long way from it's inception, and has to ability to run on *nix and Windows.

    We tried Cacti and just didn't like it. I've looked at Nagios, but not in detail.

    Find JFFNMS at www.jffnms.org
  • by blunte ( 183182 ) on Monday February 28, 2005 @08:43PM (#11808569)
    This thing rocks.

    Profiler Rx [tek-tools.com]

    • Tabulate results from some system health monitoring applications, and send them to a web page.
    • Have the web page refresh itself every 10 seconds.
    • Also, send yourself email, when really bad things happen.
    • Highlight problem areas in a color such as red.
    • Leave this web page on throughout the day.
  • Rule #1: Do not add to the problem.

    This is an entire category of Operations Management and can encompass everything. Don't take it lightly and don't be afraid to start small. The first thing you need to do is categorize what you want to monitor into individual sections and work on the easiest stuff first. By the time you work up to the tough stuff, you'll have an idea of whats available, what your capabilities are and hopefully the easy stuff can be quickly rewritten/integrated into a netter solution. Don't miss the critical stuff in a morass of junk alerts. Sample consideration (everything that moves in the data center):

    Hardware:
    ---------
    1) Server
    2) Storage
    3) Network
    4) Power
    5) Environment
    6) Security

    Software:
    ---------
    1) OS
    2) Applications
    3) Security

    Events:
    -------
    1) Failures
    2) Alerts
    3) Misfires
    4) Security

    Triggers:
    ---------
    1) Notification/False positives
    2) Action plans/Event handling
    3) Documentation, Documentation, Doumentation
    4) Reporting aka analysis and cya
    5) Security

    Once you're done building it, start over. The last tier is the most visible e.g. delegating a raid rebuild page to the opcenter flunky without proper documentation is a Career Limiting Move (CLM), building the best monitoring system is a fucking waste unless you pay attention to it. The most apt cliches for monitor normalisation are all military: Warrooms, Bridges, Weapons Hot, Communications channels etc. View everything as SNAFU and work from there.

    Rule #1: Do not add to the problem.
  • Well, hope this helps, but there is the package i've written over the past few years (oh my, it's been about 9-10 now..ugh) that i call sysmon [sysmon.org]. It does network and host based (application) monitoring for most of the well-known services. It's designed to alert to e-mail, so you just need to set up your local e-mail -> qpage daemon on that host and connect modem or TAP/SMS device to your serial port. You could also notify via SNPP as part of qpage or similar process..

    It's set up to do network topology

  • I haven't heard anyone mention NetIQ's Appmanager.
    http://www.netiq.com/products/am/default.asp [netiq.com]
    I think Microsoft sold a watered down version of it as MOM which seems to never have taken off.

    Anyway, we looked at all the monitoring packages for keeping track of over a hundred Windows servers and it was the best for us due to flexibility, mid-cost, and capabilities to be centralized but monitor remote sites locally.

    You can even write your own jobs to do what you want with almost no limits.

    It has evolved an
  • Zabbix (Score:1, Interesting)

    by Anonymous Coward

    Check Zabbix [zabbix.com] if you're looking for a solution which is free, supports all platforms, and easy to deploy. Look at screenshots.

    My company uses it for several months already in a mixed Windows/Unix environment (~180 servers) with great success. We use nearly all features (notifications, graphs, network maps, SLA monitoring, cool screens) Zabbix provides. Very useful stuff indeed. We tried Nagios before, but found it complex and hard to maintain. Besides performance of Nagios was disappointing (not enough tu

    • Re:Zabbix (Score:3, Insightful)

      by egon ( 29680 ) *
      I've played with Nagios a fair bit. When I saw the Zabbix article in the recent Linux Magazine, I thought I'd check it out. The graphing looked much easier to do, and the I liked the concept of "screens".

      Having used it for a couple weeks now, I would call it "software with potential". It's not quite there yet, and has the feeling of being immature software. This is not a dig - quite the contrary. I think that zabbix has a *lot* of potential, but I think it needs a little more time before it's ready.
  • try argus... (Score:3, Informative)

    by bani ( 467531 ) on Tuesday March 01, 2005 @08:26AM (#11811389)
    argus [tcp4me.com] is not bad, is pretty simple to set up, and scales reasonably well.

    it's also pretty flexible so you can plug it into just about any paging system and monitor just about any service you can imagine.

    it handles heirarchies so you dont get 300 pages at once, etc. and it has a simple, fast, clean web interface which isn't bloated with gigabytes of shiny widgets and is even perfectly usable via lynx.

    it has a few rough edges but the overall ease of use and simplicity make up for it.
  • At my workplace, we're using HP-OVO. We have 300+ HP-UX, 200+ Windows and 50-odd Linux servers. It is highly customizable and allows you to set up what all to monitor, the time slots within which to ignore them, etc etc etc. You basically have to install SPIs (Smart Plug-Ins) for each of the parts you wanna monitor (OS, DB, hardware, custom applications...)

    And yes, it can also be configured to alert you via email/SMS.

    -- rxmx --

  • Look at Klir Technologies- www.klir.com Klir has developed an enterprise wide tool and dashboard technology that increases the visibility into the performance of all infrastructure (data, wireless, VPN, servers, applications, routers, etc) without the use of agents, hardware, or software. They also tie performance to cost so executives and directors can quantify ROI on existing infrastructure, and get a better understanding of how infrastructure is being used in a business context.
  • I think you've highlighted an important notion when you talk about creating a centralized alerting system that will alert you to problems but will also help cut down your response time when there is a problem.

    The key here is to move beyond just instrumentation, and add a layer of service mapping and correlation. The challenge is to map the data and events you will get from Nagios (or other monitoring tool/agent) to the services that the systems are provisioning and your service objectives (performance/ava

  • I so far haven't come across anything open source that follows the netcool event processing model. For those of you who never had to use netcool (i'ts big $$$, btw, but very popular among large telcos/isps), all it is is a relational database based event processor that heavily relies on stored procedures and triggers to perform "deduplication" and correlation and is highly customizeable. I don't see anything that their (proprietary) database engine does that couldn't be done with PostgreSQL.
  • I have not tried this one but, I saw it a few months ago and thought that it looked interesting. It isn't as powerful as OpenNMS or Nagios but it does have some advantages of its own.

    This one [linuxtech.cc] is Don O'Neill's own combination of several different small packages. Basically, it provides a very nice front end to fairly extensive MRTG monitoring. But, it does give you an idea of what you can do and it certainly looks customizable.
  • There's this product named ManageEngine OpManager which can do it for you. Try it out ... http://www.opmanager.com
  • Big brother
    http://www.bb4.org/

    and big sister

    http://bigsister.graeff.com/
    Big Sister does for you:

    * monitor networked systems
    * provide a simple view on the current network status
    * notify you when your systems are becoming critical
    * generate a history of status changes
    * log and display a variety of system performance data

    I worked for a company thats been using big brother for years. great but the config syntax sucks. Big sister is easier i believe but im not sure.

Say "twenty-three-skiddoo" to logout.

Working...