[Lustre-discuss] Lustre SNMP module
Mark Seger
Mark.Seger at hp.com
Fri Mar 21 13:28:31 PDT 2008
> This is a very interesting example, and I wish we had known about
> collectl a year ago before we invested time in writing data gathering
> scripts which aren't as useful as what you have here.
>
I had mentioned collectl/lustre in a couple of places before but I guess
I wasn't loud enough. 8-)
The important thing is I got your attention.
> One question - is this "over readahead" still a problem? I know there
> was an but like this (anything over 8kB was considered to be sequential
> and invoked readahead because it generated 3 consecutive pages of IO),
> but I thought it had been fixed some time ago. There is a sanity.sh
> test_101 that exercises random reads oand checks that there are no
> discarded pages.
>
actually, as we speak I'm getting ready to release a new version of
collectl (stay tuned).
while I don't have any specific readahead needs, I believe there is
still something not right. I also think the operations manual is
misleading because it says readahead is triggered after the second
sequential read and I'd think one could interpret that to mean when you
do your second read you invoke readahead, but it's really not until your
third read. Furthermore, 'read' sounds like a read call when in fact it
really means - as you stated above - it's the 3rd page, not call. And
finally when you say this has been fixed, what exactly does that mean?
does readahead work differently now?
Anyhow getting back to some of my experiments, and these are on 1.6.4.3.
First of all, I discovered my perl script that was doing the random
reads was using the perl 'read' function rather than 'sysread' so
there's some stuff extra happening there behind the scenes I'm not
really sure about. However, it's causing a lot of readahead (or at
least excess network traffic) and that puzzles me. Here's an example of
doing 8K reads using perl's read function:
[root at cag-dl145-172 disktests]# collectl -snl -OR -oT
# <----------Network----------><-------------Lustre
Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
Hits Misses
16:33:41 141 148 26 138 69 276 0 0
0 61
16:33:42 296 307 52 261 70 280 0 0
2 64
16:33:43 311 323 54 275 78 312 0 0
0 64
16:33:44 310 321 54 276 73 292 0 0
0 63
16:33:45 306 316 53 266 63 252 0 0
0 61
16:33:46 301 311 53 267 76 304 0 0
0 68
and you can clearly see the traffic on the network matches what lustre
is delivering to the client. I also saw in the rpc stats that all the
requests were for single pages when they should have been for 2. But now
look what happens when I go to 9K
# <----------Network----------><-------------Lustre
Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
Hits Misses
16:34:42 13017 8887 349 4597 39 156 0 0
0 48
16:34:43 15310 10443 418 5544 65 260 0 0
0 69
16:34:44 18801 12839 501 6601 58 232 0 0
0 62
16:34:45 19436 13263 522 6926 24 96 0 0
0 32
This is clearly generating a lot of network traffic compared to the
client's data rate. Perhaps someone who is more familiar with the
subtleties of the perl 'read' function will know.
Anyhow when I changed my 'read' to 'sysread' things seem to get better
so perhaps readahead indeed works differently now? If so does that mean
the current definition is wrong? If so, what should it be? In any
event, playing around a little I kind of stumbled on this one. I ran my
perl script to do a single sysread, sleep a second and then do another.
While I couldn't see it doing any unexpected network traffic for 12K
requests, look what happens for 50K ones:
# <----------Network----------><-------------Lustre
Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
Hits Misses
16:41:32 55 41 2 31 1 50 0 0
12 1
16:41:33 56 46 4 38 1 50 0 0
12 1
16:41:34 55 41 2 31 1 50 0 0
12 1
16:41:35 55 40 2 31 1 50 0 0
12 1
16:41:36 1122 766 30 408 1 50 0 0
12 1
16:41:37 55 41 2 31 1 50 0 0
12 1
16:41:38 55 40 2 31 1 50 0 0
0 1
16:41:39 1130 774 30 412 0 0 0 0
12 0
If not readahead, lustre is certainly doing something funky over the
wire... And finally, if I remove the sleep and just do a bunch of 50K
reads here's what I see:
# <----------Network----------><-------------Lustre
Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
Hits Misses
16:45:35 2952 2061 98 1121 49 2450 0 0
564 47
16:45:36 4744 3296 149 1745 40 2000 0 0
468 39
16:45:37 5158 3562 153 1884 46 2300 0 0
541 43
16:45:38 5816 4027 177 2129 47 2350 0 0
552 46
16:45:39 3601 2520 120 1356 52 2600 0 0
610 50
16:45:40 4897 3405 155 1808 51 2550 0 0
564 47
16:45:41 5862 4061 178 2134 49 2450 0 0
588 49
16:45:42 4799 3336 151 1763 52 2600 0 0
588 49
16:45:43 5864 4067 179 2139 52 2600 0 0
573 48
16:45:44 4836 3362 153 1799 38 1900 0 0
444 37
16:45:45 4199 2913 130 1550 55 2750 0 0
587 47
16:45:46 6938 4789 204 2498 53 2650 0 0
600 50
16:45:47 4854 3373 153 1789 46 2300 0 0
494 38
on the average it looks like 2-3 times more data is being sent over the
network than the client is delivering. Any thoughts of what's going on
in these cases? In any event feel free to download collectl and check
things out for yourself. I'll notify this list when that happens.
sorry for the long reply...
-mark
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list