<div>Hi Andreas -</div>

<div> </div>

<div>Thanks for the advice.</div>

<div>I will gather additional CPU stats and see what shows up.</div>

<div> </div>

<div>However, CPU does not seem to be a factor</div>

<div>in the slower than expected large file buffered I/O reads.</div>

<div> </div>

<div>My machines have dual quad 2.66ghz processors,</div>

<div>and gross CPU usage hovers around 50%</div>

<div>when I'm running 16 "dd" read jobs.</div>

<div> </div>

<div>But a suspected client-side caching problem</div>

<div>crops up</div>

<div>when I just run a simple "dd" read job twice.</div>

<div> </div>

<div>The first time I run the single "dd" read job </div>

<div>I get an expected throughput of 60-megabytes-per-second or so.</div>

<div>However, </div>

<div>the second time I run the job,</div>

<div>I get a throughput of about 2-gigabytes-per-second,</div>

<div>which is </div>

<div>twice the top speed of my 10gb NIC,</div>

<div>and only possible, I think,</div>

<div>if the entire file was cached on the client </div>

<div>when the first "dd" job was run.</div>

<div> </div>

<div>So, if I run 16 "dd" jobs, </div>

<div>each trying to cache entire large files on the client,</div>

<div>that could explain the unexpected slow aggregate throughput</div>

<div> </div>

<div>I would have thought that setting a low value for "max_cached_mb" </div>

<div>would have solved this problem,</div>

<div>but it made no difference.<br> </div>

<div>A further indication that client-side caching is at the root of speed slowdown,</div>

<div>is that </div>

<div>when I run my single "dd" job twice;</div>

<div>but I drop client-side cache after the first run,</div>

<div>(via "/proc/sys/vm/drop_caches"),</div>

<div>I get an expected 60-megabytes-per-second or so throughput for both runs.<br> </div>

<div>Until I learn how to overcome this slowdown problem,</div>

<div>I'll see if if I can obtain my required </div>

<div>concurrent, multi large file read speed</div>

<div>by carefully striping the files over a few boxes.</div>

<div> </div>

<div>Again,</div>

<div>thanks for your help,</div>

<div>and</div>

<div>I'll appreciate any other suggestions you might have,</div>

<div>or</div>

<div>any ideas for other diagnostics we might run.</div>

<div> </div>

<div>Rick</div>

<div> </div>

<div><span class="gmail_quote">On 8/4/09, <b class="gmail_sendername">Andreas Dilger</b> <<a href="mailto:adilger@sun.com">adilger@sun.com</a>> wrote:</span>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:<br>> I'm new to Lustre (v1.8.0.1), and I've verified that<br>

> I can get about 1000-megabytes-per-second aggregate throughput<br>> for large file sequential reads using direct-I/O.<br>> (only limited by the speed of my 10gb NIC with TCP offload engine).<br>><br>> the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second<br>

> aggregate throughput, but when I try the same tests with normal buffered<br>> I/O, (by just running "dd" without "iflag=direct"), the runs<br>> only get about a 550-megabyte-per-second aggregate throughput.<br>

><br>> I suspect that this slowdown may have something to do with<br>> client-side-caching, but normal buffered reads have not speeded up,<br>> even after I've tried such adjustments as:<br><br>Note that there is a significant CPU overhead on the client when using<br>

buffered IO, simply due to CPU usage from copying the data between<br>userspace and the kernel.  Having multiple cores on the client (one<br>per dd process) allows distributing this copy overhead between cores.<br><br>You could also run "oprofile" to see if there is anything else of<br>

interest that is consuming a lot of CPU.<br><br>Cheers, Andreas<br>--<br>Andreas Dilger<br>Sr. Staff Engineer, Lustre Group<br>Sun Microsystems of Canada, Inc.<br><br></blockquote></div><br>