[Lustre-discuss] Multiple IB ports

Mon Mar 21 09:23:53 PDT 2011

>
> > I was told by a colleague that there were currently too many internal
> locks in the clients to sustain a big throughput. Lustre is designed for
> global throughput on many clients, but not on individual clients.
>

The LNet SMP scaling fixes/enhancements should help but I don't believe they
are coming until 2.1.

> I can observe this on my site, where I have enough storage and servers to
> reach 21GB/s globally, but am unable to get more than 300MB/s on a single
> client even though the DDR IB network would sustain +800MB/s ...
>

You probably need to disable checksums, and a DDR should be able to sustain
1.5 GB/s. I've seen close to these rates with LNet self tests I don't see
them usually in normal operations with the file system added on top.

> There must be something wrong with your configuration or the code has some
> bug, because we have had single clients doing 2GB/s in the past.  What
> version of Lustre did you test on?
>

I've never seen as high as 2 GB/s from a single client but I've only been
focused on single-threaded IO.  For that I've seen between 1.3 and 1.4 GB/s
peak.

I spent a little time trying to figure out what that was before with system
tap, but I only looked at the read case.  It looked like the per page
locking penalty can be high.  Monitoring each ll_readpage I was seeing an
median average of 2.4 us for the read scenario while the mode average was
only .5 us.  IIRC it was the llap locking that accounted for most of the
ll_readpage time.  I didn't look at the penalty for rebalancing the cache
between the various CPUs.

Using those numbers:
>>> ((1/.000002406) * 4096)/2**20
1623.5453034081463

Give me a best case scenario of ~1.6 GB/s.  I thought about working the read
case but realized the effort probably wasn't worth putting into 1.8 and I
would have to wait until 2.0 to test more.  Unfortunately I haven't had the
time now to look at 2.0+.

>
> Is this a single-threaded write?  With single-threaded IO the bottleneck
> often happens in the kernel copy_{to,from}_user() that is copying data
> to/from userspace in order to do data caching in the client.  Having
> multiple threads doing the IO allows multiple cores to do the data copying.
>

Even with the copy_{to,from}_user() should be able to provide at least >5
GB/s.  I've seen about 5.5 GB/s reading cached data on a client with lots of
memory.

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110321/439b65d1/attachment.htm>