[Lustre-discuss] Lustre SNMP module
Andreas Dilger
adilger at sun.com
Fri Mar 21 18:24:19 PDT 2008
On Mar 21, 2008 16:28 -0400, Mark Seger wrote:
> > One question - is this "over readahead" still a problem? I know there
> > was an but like this (anything over 8kB was considered to be sequential
> > and invoked readahead because it generated 3 consecutive pages of IO),
> > but I thought it had been fixed some time ago. There is a sanity.sh
> > test_101 that exercises random reads oand checks that there are no
> > discarded pages.
>
> while I don't have any specific readahead needs, I believe there is
> still something not right. I also think the operations manual is
> misleading because it says readahead is triggered after the second
> sequential read and I'd think one could interpret that to mean when you
> do your second read you invoke readahead, but it's really not until your
> third read.
Well, the _sequential_ part of that statement is important. The first
read is just a read. When the second read is done you can determine if
it is sequential or not, and likewise the third read will be the second
_sequential_ read. I suppose we could clarify that a bit in any case.
> Furthermore, 'read' sounds like a read call when in fact it
> really means - as you stated above - it's the 3rd page, not call.
Note that my mention above was in the past tense. The readahead code
now makes decisions based on the sys_read sizes and not the individual
pages.
> And finally when you say this has been fixed, what exactly does that mean?
> does readahead work differently now?
The readahead detection code was fixed in 1.4.7 or so to make the
decisions based on sequential sys_read() requests, and does not decide
based on individual sequential pages being read. This means that the
readahead is done with a multiple of the sys_read() size, and isn't
confused by sequential pages within a single read.
In the very latest code (upcoming 1.6.5 only I think) there is also
strided readahead so that if a client is reading, say, 5x1MB every 100MB
(common HPC load) then the readahead will detect this and start readahead
of 5MB every 100MB instead of continuing linear readahead for 40MB,
detecting a seek, and then resetting the readahead.
> Anyhow getting back to some of my experiments, and these are on 1.6.4.3.
> First of all, I discovered my perl script that was doing the random
> reads was using the perl 'read' function rather than 'sysread' so
> there's some stuff extra happening there behind the scenes I'm not
> really sure about. However, it's causing a lot of readahead (or at
> least excess network traffic) and that puzzles me. Here's an example of
> doing 8K reads using perl's read function:
>
> [root at cag-dl145-172 disktests]# collectl -snl -OR -oT
> # <----------Network----------><-------------Lustre
> Client-------------->
> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
> Hits Misses
> 16:33:41 141 148 26 138 69 276 0 0
> 0 61
> 16:33:42 296 307 52 261 70 280 0 0
> 2 64
> 16:33:43 311 323 54 275 78 312 0 0
> 0 64
> 16:33:44 310 321 54 276 73 292 0 0
> 0 63
> 16:33:45 306 316 53 266 63 252 0 0
> 0 61
> 16:33:46 301 311 53 267 76 304 0 0
> 0 68
>
> and you can clearly see the traffic on the network matches what lustre
> is delivering to the client. I also saw in the rpc stats that all the
> requests were for single pages when they should have been for 2.
Yes, pretty clearly this is a problem, and will go back to confusing
the readahead, but at this stage there isn't much the readahead can
do about it. Well, I suppose the chance of a program going from
purely random reads to straight linear reads is uncommon. We might
hint to the readahead that a random reader needs to do 4 or 5 sequential
reads before it gets reset to doing readahead, instead of just 2.
> Anyhow when I changed my 'read' to 'sysread' things seem to get better
> so perhaps readahead indeed works differently now? If so does that mean
> the current definition is wrong? If so, what should it be? In any
> event, playing around a little I kind of stumbled on this one. I ran my
> perl script to do a single sysread, sleep a second and then do another.
> While I couldn't see it doing any unexpected network traffic for 12K
> requests, look what happens for 50K ones:
>
> # <----------Network----------><-------------Lustre
> Client-------------->
> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
> Hits Misses
> 16:41:32 55 41 2 31 1 50 0 0
> 12 1
> 16:41:33 56 46 4 38 1 50 0 0
> 12 1
> 16:41:34 55 41 2 31 1 50 0 0
> 12 1
> 16:41:35 55 40 2 31 1 50 0 0
> 12 1
> 16:41:36 1122 766 30 408 1 50 0 0
> 12 1
> 16:41:37 55 41 2 31 1 50 0 0
> 12 1
> 16:41:38 55 40 2 31 1 50 0 0
> 0 1
> 16:41:39 1130 774 30 412 0 0 0 0
> 12 0
>
> If not readahead, lustre is certainly doing something funky over the
> wire...
How big is the file being read here? There is a new feature in the
readahead code that if the file size is < 2MB it will fetch the whole
file instead of just the small read, because the overhead of doing
3 read RPCs in order to detect sequential readahead is high compared
to the overhead of doing a larger read the first time. I don't know
if that is the case or not.
> And finally, if I remove the sleep and just do a bunch of 50K
> reads here's what I see:
>
> # <----------Network----------><-------------Lustre
> Client-------------->
> #Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite
> Hits Misses
> 16:45:35 2952 2061 98 1121 49 2450 0 0
> 564 47
> 16:45:36 4744 3296 149 1745 40 2000 0 0
> 468 39
> 16:45:37 5158 3562 153 1884 46 2300 0 0
> 541 43
> 16:45:38 5816 4027 177 2129 47 2350 0 0
> 552 46
> 16:45:39 3601 2520 120 1356 52 2600 0 0
> 610 50
> 16:45:40 4897 3405 155 1808 51 2550 0 0
>
> on the average it looks like 2-3 times more data is being sent over the
> network than the client is delivering. Any thoughts of what's going on
> in these cases?
One possibility (depending on file size and random number generator) is
that you are occasionally getting sequential random numbers and this is
triggering readahead? This would be easily detectable inside your test
program.
> In any event feel free to download collectl and check
> things out for yourself. I'll notify this list when that happens.
Yes, I've been meaning to take a look for a while now. It looks like
a very powerful, useful, and also usable tool.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list