[Lustre-discuss] how to baseline the performance of a Lustre cluster?

Theodore Omtzigt theo at stillwater-sc.com
Fri Jul 15 13:30:47 PDT 2011


I got a basic Lustre cluster up and running and did two experiments:

1- using GbE as the interconnect
2- using QDR IB as interconnect

Here are the simple performance results I collected using the pointers 
from the Lustre user guide:

[root at 22-82 ~]# ost-survey /mnt/lustre01
/usr/bin/ost-survey: 06/29/11 OST speed survey on /mnt/lustre01 from 
172.21.22.82 at tcp
Number of Active OST devices : 8
Worst  Read OST indx: 1 speed: 57.704295
Best   Read OST indx: 3 speed: 61.655312
Read Average: 59.785920 +/- 1.245626 MB/s
Worst  Write OST indx: 3 speed: 28.564328
Best   Write OST indx: 5 speed: 70.976016
Write Average: 55.497721 +/- 13.457404 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     59.931       65.722        0.501      0.456
1     57.704       42.436        0.520      0.707
2     60.056       66.074        0.500      0.454
3     61.655       28.564        0.487      1.050
4     58.096       62.979        0.516      0.476
5     59.935       70.976        0.501      0.423
6     61.053       57.724        0.491      0.520
7     59.856       49.507        0.501      0.606

[root at 22-82_ib ~]# ost-survey /mnt/lustre01
/usr/bin/ost-survey: 07/14/11 OST speed survey on /mnt/lustre01 from 
10.1.3.82 at o2ib
Number of Active OST devices : 8
Worst  Read OST indx: 0 speed: 180.625987
Best   Read OST indx: 6 speed: 214.961331
Read Average: 200.478485 +/- 11.408814 MB/s
Worst  Write OST indx: 0 speed: 291.709350
Best   Write OST indx: 6 speed: 496.616135
Write Average: 397.025375 +/- 59.815286 MB/s
Ost#  Read(MB/s)  Write(MB/s)  Read-time  Write-time
----------------------------------------------------
0     180.626       291.709        0.166      0.103
1     206.211       396.815        0.145      0.076
2     207.928       356.645        0.144      0.084
3     197.543       384.335        0.152      0.078
4     206.908       403.361        0.145      0.074
5     205.670       470.235        0.146      0.064
6     214.961       496.616        0.140      0.060
7     183.981       376.487        0.163      0.080

Are these results any good?

To me it looks very disappointing as we can get 3GB/s from the RAID 
controller aggregating a collection of raw SAS drives on the OSTs, and 
we should be able to get a peak of -5GB/s from QDR IB.

First question: is this baseline reasonable?
Second question: what are the tools I can use to better understand the 
Lustre FS behavior to characterize the performance I am getting on the 
client side?

I did check the IB network and I did not record any IB network errors 
during these runs. So I am confident that the IB network was working 
properly.

Looking forward to better understanding Lustre performance.



More information about the lustre-discuss mailing list