[Lustre-discuss] Lustre SNMP module

Mon Mar 10 16:58:04 PDT 2008

Hi Brian, 

On Monday 10 March 2008 03:04:33 pm Brian J. Murrell wrote:
> I can't disagree with that, especially as Lustre installations get
> bigger and bigger.  Apart from writing custom monitoring tools,
> there's not a lot of "pre-emptive" monitoring options available. 
> There are a few tools out there like collectl (never seen it, just
> heard about it) 

collectl is very nice, but as dstat and such, it has to run on each and
every host. It can provide its results via sockets though, so it could
be used as a centralized monitoring system for a Lustre installation.

And it provides detailled statistics too:

# collectl -sL -O R
waiting for 1 second sample...

# LUSTRE CLIENT DETAIL: READAHEAD
#Filsys   Reads ReadKB  Writes WriteKB  Pend  Hits Misses NotCon MisWin LckFal  Discrd ZFile ZerWin RA2Eof HitMax
home        100    192       0       0     0     0    100      0      0      0      0      0    100      0      0
scratch     100    192       0       0     0     0    100      0      0      0      0      0    100      0      0
home        102   6294      23     233     0     0     87      0      0      0      0      0     87      0      0
scratch     102   6294      23     233     0     0     87      0      0      0      0      0     87      0      0
home         95    158      22     222     0     0     81      0      0      0      0      0     81      0      0
scratch      95    158      22     222     0     0     81      0      0      0      0      0     81      0      0

# collectl -sL -O M
waiting for 1 second sample...

# LUSTRE CLIENT DETAIL: METADATA
#Filsys   Reads ReadKB  Writes WriteKB  Open Close GAttr SAttr  Seek Fsync DrtHit DrtMis
home          0      0       0       0     0     0     0     0     0     0      0      0
scratch       0      0       0       0     0     0     2     0     0     0      0      0
home          0      0       0       0     0     0     0     0     0     0      0      0
scratch       0      0       0       0     0     0     0     0     0     0      0      0
home          0      0       0       0     0     0     0     0     0     0      0      0
scratch       0      0       0       0     0     0     1     0     0     0      0      0

# collectl -sL -O B
waiting for 1 second sample...

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost              Rds  RdK   1K   2K   4K   8K  16K  32K  64K 128K 256K Wrts WrtK   1K   2K   4K   8K  16K  32K  64K 128K 256K
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    9    0    0    0    0    0    0    0    0   12 3075    9    0    0    0    0    0    0    0    3
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0    1    2    1    0    0    0    0    0    0    0    0
home-OST0007        0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
scratch-OST0007     0    0    1    0    0    0    0    0    0    0    0    1    2    1    0    0    0    0    0    0    0    0

> and LLNL have one on sourceforge, 

Last time I checked, it only supported 1.4 versions, but it's been a while, 
so I'm probably a bit behind.

> but I can certainly  
> see the attraction at being able to monitor Lustre on your servers
> with the same tools as you are using to monitor the servers' health
> themselves.

Yes, that'd be a strong selling point.

> This could wind becoming a lustre-devel@ discussion, but for now, it
> would be interesting to extend the interface(s) we use to
> introduce /proc (and what will soon be it's replacement/augmentation)
> stats files so that they are automagically provided via SNMP.

That sounds like the way to proceed, indeed.

> You know, given the discussion in this thread:
> http://lists.lustre.org/pipermail/lustre-devel/2008-January/001475.ht
>ml now would be a good time for the the community (that perhaps might
> want to contribute) desiring SNMP access to get their foot in the
> door. Ideally, you get SNMP into the generic interface and then SNMP
> access to all current and future variables comes more or less free.

Oh, thanks for pointing this. It looks like major underlying changes 
are coming. I think I'll subscribe to the lustre-devel ML to try to 
follow them.

> That all said, there are some /proc files which provide a copious
> amount of information, like brw_stats for instance.  I don't know how
> well that sort of thing maps to SNMP, but having an SNMP manager
> watching something as useful as brw_stats for trends over time could
> be quite interesting.

Add some RRD graphs to keep historical variations, and you got the 
all-in-one Lustre monitoring tool we sysadmins are all waiting for. ;)

Cheers,
-- 
Kilian