[Lustre-discuss] Lustre SNMP module

Fri Mar 21 18:24:19 PDT 2008

On Mar 21, 2008  16:28 -0400, Mark Seger wrote:
> > One question - is this "over readahead" still a problem?  I know there
> > was an but like this (anything over 8kB was considered to be sequential
> > and invoked readahead because it generated 3 consecutive pages of IO),
> > but I thought it had been fixed some time ago.  There is a sanity.sh
> > test_101 that exercises random reads oand checks that there are no
> > discarded pages.
>
> while I don't have any specific readahead needs, I believe there is 
> still something not right.  I also think the operations manual is 
> misleading because it says readahead is triggered after the second 
> sequential read and I'd think one could interpret that to mean when you 
> do your second read you invoke readahead, but it's really not until your 
> third read.

Well, the _sequential_ part of that statement is important.  The first
read is just a read.  When the second read is done you can determine if
it is sequential or not, and likewise the third read will be the second
_sequential_ read.  I suppose we could clarify that a bit in any case.

> Furthermore, 'read' sounds like a read call when in fact it 
> really means - as you stated above - it's the 3rd page, not call.

Note that my mention above was in the past tense.  The readahead code
now makes decisions based on the sys_read sizes and not the individual
pages.

> And finally when you say this has been fixed, what exactly does that mean?  
> does readahead work differently now?

The readahead detection code was fixed in 1.4.7 or so to make the
decisions based on sequential sys_read() requests, and does not decide
based on individual sequential pages being read.  This means that the
readahead is done with a multiple of the sys_read() size, and isn't
confused by sequential pages within a single read.

In the very latest code (upcoming 1.6.5 only I think) there is also
strided readahead so that if a client is reading, say, 5x1MB every 100MB
(common HPC load) then the readahead will detect this and start readahead
of 5MB every 100MB instead of continuing linear readahead for 40MB,
detecting a seek, and then resetting the readahead.

> Anyhow getting back to some of my experiments, and these are on 1.6.4.3. 
> First of all, I discovered my perl script that was doing the random 
> reads was using the perl 'read' function rather than 'sysread' so 
> there's some stuff extra happening there behind the scenes I'm not 
> really sure about.  However, it's causing a lot of readahead (or at 
> least excess network traffic) and that puzzles me.  Here's an example of 
> doing 8K reads using perl's read function:
> 
> [root at cag-dl145-172 disktests]# collectl -snl -OR -oT
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:33:41     141    148     26     138     69    276      0       0      
> 0     61
> 16:33:42     296    307     52     261     70    280      0       0      
> 2     64
> 16:33:43     311    323     54     275     78    312      0       0      
> 0     64
> 16:33:44     310    321     54     276     73    292      0       0      
> 0     63
> 16:33:45     306    316     53     266     63    252      0       0      
> 0     61
> 16:33:46     301    311     53     267     76    304      0       0      
> 0     68
> 
> and you can clearly see the traffic on the network matches what lustre 
> is delivering to the client.  I also saw in the rpc stats that all the 
> requests were for single pages when they should have been for 2.

Yes, pretty clearly this is a problem, and will go back to confusing
the readahead, but at this stage there isn't much the readahead can
do about it.  Well, I suppose the chance of a program going from
purely random reads to straight linear reads is uncommon.  We might
hint to the readahead that a random reader needs to do 4 or 5 sequential
reads before it gets reset to doing readahead, instead of just 2.

> Anyhow when I changed my 'read' to 'sysread' things seem to get better 
> so perhaps readahead indeed works differently now?  If so does that mean 
> the current definition is wrong?  If so, what should it be?  In any 
> event, playing around a little I kind of stumbled on this one.  I ran my 
> perl script to do a single sysread, sleep a second and then do another.  
> While I couldn't see it doing any unexpected network traffic for 12K 
> requests, look what happens for 50K ones:
> 
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:41:32      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:33      56     46      4      38      1     50      0       0     
> 12      1
> 16:41:34      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:35      55     40      2      31      1     50      0       0     
> 12      1
> 16:41:36    1122    766     30     408      1     50      0       0     
> 12      1
> 16:41:37      55     41      2      31      1     50      0       0     
> 12      1
> 16:41:38      55     40      2      31      1     50      0       0      
> 0      1
> 16:41:39    1130    774     30     412      0      0      0       0     
> 12      0
> 
> If not readahead, lustre is certainly doing something funky over the 
> wire...

How big is the file being read here?  There is a new feature in the
readahead code that if the file size is < 2MB it will fetch the whole
file instead of just the small read, because the overhead of doing
3 read RPCs in order to detect sequential readahead is high compared
to the overhead of doing a larger read the first time.  I don't know
if that is the case or not.

> And finally, if I remove the sleep and just do a bunch of 50K 
> reads here's what I see:
> 
> #         <----------Network----------><-------------Lustre 
> Client-------------->
> #Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
> Hits Misses
> 16:45:35    2952   2061     98    1121     49   2450      0       0    
> 564     47
> 16:45:36    4744   3296    149    1745     40   2000      0       0    
> 468     39
> 16:45:37    5158   3562    153    1884     46   2300      0       0    
> 541     43
> 16:45:38    5816   4027    177    2129     47   2350      0       0    
> 552     46
> 16:45:39    3601   2520    120    1356     52   2600      0       0    
> 610     50
> 16:45:40    4897   3405    155    1808     51   2550      0       0    
> 
> on the average it looks like 2-3 times more data is being sent over the 
> network than the client is delivering.  Any thoughts of what's going on 
> in these cases?

One possibility (depending on file size and random number generator) is
that you are occasionally getting sequential random numbers and this is
triggering readahead?  This would be easily detectable inside your test
program.

> In any event feel free to download collectl and check 
> things out for yourself.  I'll notify this list when that happens.

Yes, I've been meaning to take a look for a while now.  It looks like
a very powerful, useful, and also usable tool.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.