<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">> I was told by a colleague that there were currently too many internal locks in the clients to sustain a big throughput. Lustre is designed for global throughput on many clients, but not on individual clients.<br>

</div></blockquote><div><br></div><div>The LNet SMP scaling fixes/enhancements should help but I don't believe they are coming until 2.1.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">

> I can observe this on my site, where I have enough storage and servers to reach 21GB/s globally, but am unable to get more than 300MB/s on a single client even though the DDR IB network would sustain +800MB/s ...<br>

</div></blockquote><div><br></div><div>You probably need to disable checksums, and a DDR should be able to sustain 1.5 GB/s. I've seen close to these rates with LNet self tests I don't see them usually in normal operations with the file system added on top.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div class="im">

<br>

</div>There must be something wrong with your configuration or the code has some bug, because we have had single clients doing 2GB/s in the past.  What version of Lustre did you test on?<br></blockquote><div><br></div><div>

I've never seen as high as 2 GB/s from a single client but I've only been focused on single-threaded IO.  For that I've seen between 1.3 and 1.4 GB/s peak.  </div><div><br></div><div>I spent a little time trying to figure out what that was before with system tap, but I only looked at the read case.  It looked like the per page locking penalty can be high.  Monitoring each ll_readpage I was seeing an median average of 2.4 us for the read scenario while the mode average was only .5 us.  IIRC it was the llap locking that accounted for most of the ll_readpage time.  I didn't look at the penalty for rebalancing the cache between the various CPUs.  </div>

<div><br></div><div>Using those numbers:</div><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; ">>>> ((1/.000002406) * 4096)/2**20<br>1623.5453034081463</span></div>

<div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><br></span></div><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; ">Give me a best case scenario of ~1.6 GB/s.  I thought about working the read case but realized the effort probably wasn't worth putting into 1.8 and I would have to wait until 2.0 to test more.  Unfortunately I haven't had the time now to look at 2.0+.</span></div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

Is this a single-threaded write?  With single-threaded IO the bottleneck often happens in the kernel copy_{to,from}_user() that is copying data to/from userspace in order to do data caching in the client.  Having multiple threads doing the IO allows multiple cores to do the data copying.<br>

</blockquote><div><br></div><div>Even with the copy_{to,from}_user() should be able to provide at least >5 GB/s.  I've seen about 5.5 GB/s reading cached data on a client with lots of memory.</div><div> </div><div>

Jeremy</div></div>