[Lustre-discuss] Lustre SNMP module

Fri Mar 21 13:28:31 PDT 2008

> This is a very interesting example, and I wish we had known about
> collectl a year ago before we invested time in writing data gathering
> scripts which aren't as useful as what you have here.
>   
I had mentioned collectl/lustre in a couple of places before but I guess 
I wasn't loud enough.  8-)
The important thing is I got your attention.
> One question - is this "over readahead" still a problem?  I know there
> was an but like this (anything over 8kB was considered to be sequential
> and invoked readahead because it generated 3 consecutive pages of IO),
> but I thought it had been fixed some time ago.  There is a sanity.sh
> test_101 that exercises random reads oand checks that there are no
> discarded pages.
>   
actually, as we speak I'm getting ready to release a new version of 
collectl (stay tuned).
while I don't have any specific readahead needs, I believe there is 
still something not right.  I also think the operations manual is 
misleading because it says readahead is triggered after the second 
sequential read and I'd think one could interpret that to mean when you 
do your second read you invoke readahead, but it's really not until your 
third read.  Furthermore, 'read' sounds like a read call when in fact it 
really means - as you stated above - it's the 3rd page, not call.  And 
finally when you say this has been fixed, what exactly does that mean?  
does readahead work differently now?

Anyhow getting back to some of my experiments, and these are on 1.6.4.3. 
First of all, I discovered my perl script that was doing the random 
reads was using the perl 'read' function rather than 'sysread' so 
there's some stuff extra happening there behind the scenes I'm not 
really sure about.  However, it's causing a lot of readahead (or at 
least excess network traffic) and that puzzles me.  Here's an example of 
doing 8K reads using perl's read function:

[root at cag-dl145-172 disktests]# collectl -snl -OR -oT
#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:33:41     141    148     26     138     69    276      0       0      
0     61
16:33:42     296    307     52     261     70    280      0       0      
2     64
16:33:43     311    323     54     275     78    312      0       0      
0     64
16:33:44     310    321     54     276     73    292      0       0      
0     63
16:33:45     306    316     53     266     63    252      0       0      
0     61
16:33:46     301    311     53     267     76    304      0       0      
0     68

and you can clearly see the traffic on the network matches what lustre 
is delivering to the client.  I also saw in the rpc stats that all the 
requests were for single pages when they should have been for 2. But now 
look what happens when I go to 9K

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:34:42   13017   8887    349    4597     39    156      0       0      
0     48
16:34:43   15310  10443    418    5544     65    260      0       0      
0     69
16:34:44   18801  12839    501    6601     58    232      0       0      
0     62
16:34:45   19436  13263    522    6926     24     96      0       0      
0     32

This is clearly generating a lot of network traffic compared to the 
client's data rate.  Perhaps someone who is more familiar with the 
subtleties of the perl 'read' function will know.

Anyhow when I changed my 'read' to 'sysread' things seem to get better 
so perhaps readahead indeed works differently now?  If so does that mean 
the current definition is wrong?  If so, what should it be?  In any 
event, playing around a little I kind of stumbled on this one.  I ran my 
perl script to do a single sysread, sleep a second and then do another.  
While I couldn't see it doing any unexpected network traffic for 12K 
requests, look what happens for 50K ones:

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:41:32      55     41      2      31      1     50      0       0     
12      1
16:41:33      56     46      4      38      1     50      0       0     
12      1
16:41:34      55     41      2      31      1     50      0       0     
12      1
16:41:35      55     40      2      31      1     50      0       0     
12      1
16:41:36    1122    766     30     408      1     50      0       0     
12      1
16:41:37      55     41      2      31      1     50      0       0     
12      1
16:41:38      55     40      2      31      1     50      0       0      
0      1
16:41:39    1130    774     30     412      0      0      0       0     
12      0

If not readahead, lustre is certainly doing something funky over the 
wire...  And finally, if I remove the sleep and just do a bunch of 50K 
reads here's what I see:

#         <----------Network----------><-------------Lustre 
Client-------------->
#Time     netKBi pkt-in netKBo pkt-out  Reads KBRead Writes KBWrite   
Hits Misses
16:45:35    2952   2061     98    1121     49   2450      0       0    
564     47
16:45:36    4744   3296    149    1745     40   2000      0       0    
468     39
16:45:37    5158   3562    153    1884     46   2300      0       0    
541     43
16:45:38    5816   4027    177    2129     47   2350      0       0    
552     46
16:45:39    3601   2520    120    1356     52   2600      0       0    
610     50
16:45:40    4897   3405    155    1808     51   2550      0       0    
564     47
16:45:41    5862   4061    178    2134     49   2450      0       0    
588     49
16:45:42    4799   3336    151    1763     52   2600      0       0    
588     49
16:45:43    5864   4067    179    2139     52   2600      0       0    
573     48
16:45:44    4836   3362    153    1799     38   1900      0       0    
444     37
16:45:45    4199   2913    130    1550     55   2750      0       0    
587     47
16:45:46    6938   4789    204    2498     53   2650      0       0    
600     50
16:45:47    4854   3373    153    1789     46   2300      0       0    
494     38

on the average it looks like 2-3 times more data is being sent over the 
network than the client is delivering.  Any thoughts of what's going on 
in these cases?  In any event feel free to download collectl and check 
things out for yourself.  I'll notify this list when that happens.

sorry for the long reply...

-mark

> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>