[Lustre-discuss] how to baseline the performance of a Lustre cluster?

Mon Jul 18 07:47:10 PDT 2011

Tim Carlson wrote:
> On Fri, 15 Jul 2011, Theodore Omtzigt wrote:
>
>   
>> To me it looks very disappointing as we can get 3GB/s from the RAID
>> controller aggregating a collection of raw SAS drives on the OSTs, and
>> we should be able to get a peak of -5GB/s from QDR IB.
>>
>> First question: is this baseline reasonable?
>>     
>
> For starters, the theoretical peak of QDR IB is 4GB/s in terms of moving 
> real data. 40Gb/s is the signaling rate and you need to factor in the PCI 
> bus 8/10 encoding. So your 40Gb/s becomes 32Gb/s right off the bat. 

Yes, the (unidirectional) bandwidth of QDR 4x IB is 4GB/s, including 
headers, due to the
InfiniBand 8b/10b encoding.  This is the same (raw) data rate as PCIe 
gen2 x8 (which also
uses 8b/10b encoding, to transmit 10bits for every 8-bit byte).

Interestingly, the upcoming InfiniBand "FDR" moves to 64b/66b encoding, 
which eliminates most
of the link overhead.  [8b/10b encoding exists to ensure there are 1) an 
equal number of 1&0 bits,
and 2) to set an upper bounds on the number of sequential 1 or 0 bits at 
a small number.  With
64b/66b there can now be something like 65bits in a row with the same 
value, which makes
it more susceptible to clock skew issues, although the claim is that in 
practice the number
of bits is much smaller as a scrambler is used to "randomize" the actual 
bits, and the sequences
that correspond to 64 1's or 64 0's will "never" be used.  So the 
"wrong" data pattern could
cause more problems.]

To clarify, this 4GB/s is reduced to around 3.2GB/s of data primarily 
due to the smaller packet size
of PCIe (256Bytes), where the headers consume quite a bit of the BW, or 
somewhat less when using
128byte PCIe packets.

While MPI can achieve 3.2GB/s data rates, I have never seen o2ib lnet 
get that high.  As I recall,
something ~2.5 is  more typical.

> Now 
> try and move some data with something like mpi_send and you will see that 
> the real amount of data you can send is really more like 24Gb/s or 3GB/s.
>
> The test size for ost_survey is pretty small. 30MB. You can increase that 
> with the "-s" flag. Try at least 100MB.
>
> You should also turn of checksums to test raw performance. There is an 
> lctl conf_param to do this, but the quick and dirty route on the client is 
> the following bash:
>
> for OST in /proc/fs/lustre/osc/*/checksums
> do
> echo 0 > $OST
> done
>
> For comparison sake, on my latest QDR connected Lustre file system with 
> LSI 9285-8e controllers connected to JBODs of slowing disks in 11 disk 
> RAID 6 stripes, I get around 500MB/s write and 350MB/s read using 
> ost-survey with 100MB data chunks.
>
> Your numbers seem reasonable.
>
>
> Tim
>   

Theodore,

You have jumped straight to testing Lustre over the network, without 
first providing
performance numbers for the disks when locally attached.  (You also 
didn't test the
network, but in the absence of bad links GigE and IB are less variable 
and well understood.)

As for the disk performance, were you able to measure 3GB/s from the 
raid controller, or
what is that number based on?  What was the performance of an individual 
lun (or whatever
backs your OST)?  Are all the OSTs on a single server, and you are testing
them one at a time?

You should be able to get 100+MB/s over GigE, although you may need 2 
OSTs to
do that, and larger IO sizes.  Similarly, if you access multiple OSTs 
simultaneously,
you should be > 2GB/s over o2ib.  At least I am assuming you are using 
o2ib and not
just using tcp over InfiniBand, which would be slower.

Kevin