[Lustre-discuss] Lustre SNMP module
Kilian CAVALOTTI
kilian at stanford.edu
Mon Mar 10 16:58:04 PDT 2008
Hi Brian,
On Monday 10 March 2008 03:04:33 pm Brian J. Murrell wrote:
> I can't disagree with that, especially as Lustre installations get
> bigger and bigger. Apart from writing custom monitoring tools,
> there's not a lot of "pre-emptive" monitoring options available.
> There are a few tools out there like collectl (never seen it, just
> heard about it)
collectl is very nice, but as dstat and such, it has to run on each and
every host. It can provide its results via sockets though, so it could
be used as a centralized monitoring system for a Lustre installation.
And it provides detailled statistics too:
# collectl -sL -O R
waiting for 1 second sample...
# LUSTRE CLIENT DETAIL: READAHEAD
#Filsys Reads ReadKB Writes WriteKB Pend Hits Misses NotCon MisWin LckFal Discrd ZFile ZerWin RA2Eof HitMax
home 100 192 0 0 0 0 100 0 0 0 0 0 100 0 0
scratch 100 192 0 0 0 0 100 0 0 0 0 0 100 0 0
home 102 6294 23 233 0 0 87 0 0 0 0 0 87 0 0
scratch 102 6294 23 233 0 0 87 0 0 0 0 0 87 0 0
home 95 158 22 222 0 0 81 0 0 0 0 0 81 0 0
scratch 95 158 22 222 0 0 81 0 0 0 0 0 81 0 0
# collectl -sL -O M
waiting for 1 second sample...
# LUSTRE CLIENT DETAIL: METADATA
#Filsys Reads ReadKB Writes WriteKB Open Close GAttr SAttr Seek Fsync DrtHit DrtMis
home 0 0 0 0 0 0 0 0 0 0 0 0
scratch 0 0 0 0 0 0 2 0 0 0 0 0
home 0 0 0 0 0 0 0 0 0 0 0 0
scratch 0 0 0 0 0 0 0 0 0 0 0 0
home 0 0 0 0 0 0 0 0 0 0 0 0
scratch 0 0 0 0 0 0 1 0 0 0 0 0
# collectl -sL -O B
waiting for 1 second sample...
# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost Rds RdK 1K 2K 4K 8K 16K 32K 64K 128K 256K Wrts WrtK 1K 2K 4K 8K 16K 32K 64K 128K 256K
home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scratch-OST0007 0 0 9 0 0 0 0 0 0 0 0 12 3075 9 0 0 0 0 0 0 0 3
home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0
home-OST0007 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
scratch-OST0007 0 0 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0
> and LLNL have one on sourceforge,
Last time I checked, it only supported 1.4 versions, but it's been a while,
so I'm probably a bit behind.
> but I can certainly
> see the attraction at being able to monitor Lustre on your servers
> with the same tools as you are using to monitor the servers' health
> themselves.
Yes, that'd be a strong selling point.
> This could wind becoming a lustre-devel@ discussion, but for now, it
> would be interesting to extend the interface(s) we use to
> introduce /proc (and what will soon be it's replacement/augmentation)
> stats files so that they are automagically provided via SNMP.
That sounds like the way to proceed, indeed.
> You know, given the discussion in this thread:
> http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht
>ml now would be a good time for the the community (that perhaps might
> want to contribute) desiring SNMP access to get their foot in the
> door. Ideally, you get SNMP into the generic interface and then SNMP
> access to all current and future variables comes more or less free.
Oh, thanks for pointing this. It looks like major underlying changes
are coming. I think I'll subscribe to the lustre-devel ML to try to
follow them.
> That all said, there are some /proc files which provide a copious
> amount of information, like brw_stats for instance. I don't know how
> well that sort of thing maps to SNMP, but having an SNMP manager
> watching something as useful as brw_stats for trends over time could
> be quite interesting.
Add some RRD graphs to keep historical variations, and you got the
all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)
Cheers,
--
Kilian
More information about the lustre-discuss
mailing list