[Linux_hpc_swstack] nagging nagios feelings

Makia Minich makia at sun.com
Tue Mar 3 16:35:35 PST 2009


So I've been doing some configuration and testing with Nagios and have 
been having this nagging feeling that it is going to lead to some pretty 
major issues in the future.  Monitoring is a "need-to-have" in the HPC 
world, and the leaders in this pack so far are Nagios and Ganglia. 
While we've been including Ganglia in the stack, we've never really 
aided in the configuration in lieu of taking care of other tasks.  For 
the next release, it seemed like a good idea to go ahead and pause and 
see if Ganglia is the right choice, or if perhaps Nagios can provide 
some more options.

My biggest issue, right now, is a question of scalability of Nagios. 
This is primarily drawn out when you look at just how Nagios is 
configured.  To define a cluster, you must create a host entry for each 
host within the cluster; while this is easy enough and scriptable it 
really draws out the question "are you thinking about 1000, or 10000, or 
even more nodes?"  Yes, this is only the configuration file, but it also 
progresses into the monitoring solution itself.  Nagios uses a polling 
method to check every service on every node.  In the case of 10K nodes, 
how long will it take for the same node to be checked twice; or three 
times; how long will it take before we find out that it's down?

Perhaps I just don't understand the configuration options available to 
me (which is why I'm writing, hoping someone tells me I'm stupid). 
Perhaps there are other ways to approach this with Nagios (e.g., use 
scalable units that each only monitor a subset of nodes).  Any thoughts 
out there?

(This has been cross posted on http://blogs.sun.com/giraffe)
-- 
"A simile is not a lie, unless it is a bad simile."
- Christopher John Francis Boone


More information about the Linux_hpc_swstack mailing list