[Lustre-discuss] Two questions about the tuning of Lustre file system.
Kevin Van Maren
kevin.van.maren at oracle.com
Fri May 20 08:23:37 PDT 2011
What exactly were you testing? I have no idea how to interpret your
numbers. A single client reading from a single file? One file per OST,
or file striped across all OSTs? Is the Lustre file system idle except
for your test?
In general, start with the pieces:
1) make sure the network is sane. Try measuring BW to/from each node
(client and server) to ensure all the cables are good. For your
configuration, you should be able to measure ~3.2GB/s (unidirectional)
using large MPI messages. While I prefer to use MPI, some people use
the lnet_selftest.
2) make sure each OST is sane. For each OST, create a file that is only
striped on that OST. Make sure a client can read/write each of these
files as expected. Be sure you transfer much more data than the
client+server RAM sizes.
Many issues are sorted out just getting both 1 & 2 in good shape.
Kevin
Tanin wrote:
> Dear all,
>
> I have two question regarding the performance of Lustre System.
> Currently, we have 5 OSS nodes, and each OSS carries 8 OST's. All the
> nodes (including the MDT/MGS node and client node) are connected to a
> Mellanox MTS 3600 InfiniBand switch using RDMA for data transfer. The
> bandwidth of the network is 40Gbps. The kernel version is 'Linux
> 2.6.18-164.11.1.el5_lustre.1.8.3 #1 SMP Fri Apr 9 18:00:39 MDT 2010
> x86_64 x86_64 x86_64 GNU/Linux'. OS is RHEL 5.5. Lustre version is
> 1.8.3. OFED Version is 1.5.2. IB HCA is Mellanox Technologies MT26428
> ConnectX VPI PCIe IB QDR.
>
> And I did a simple test on the client side to see the peak data
> reading performance. Here is the data:
>
> #time Data transferred Bandwidth
> 2 sec 2.18 GBytes 8.71 Gbits/sec
> 2 sec 2.06 GBytes 8.24 Gbits/sec
> 2 sec 2.10 GBytes 8.40 Gbits/sec
> 2 sec 1.93 GBytes 7.73 Gbits/sec
> 2 sec 1.50 GBytes 6.02 Gbits/sec
> 2 sec 420.00 MBytes 1.64 Gbits/sec
> 2 sec 2.19 GBytes 8.75 Gbits/sec
> 2 sec 2.08 GBytes 8.32 Gbits/sec
> 2 sec 2.08 GBytes 8.32 Gbits/sec
> 2 sec 1.99 GBytes 7.97 Gbits/sec
> 2 sec 1.80 GBytes 7.19 Gbits/sec
> *2 sec 160.00 MBytes 640.00 Mbits/sec*
> 2 sec 2.15 GBytes 8.59 Gbits/sec
> 2 sec 2.13 GBytes 8.52 Gbits/sec
> 2 sec 2.15 GBytes 8.59 Gbits/sec
> 2 sec 2.09 GBytes 8.36 Gbits/sec
> 2 sec 2.09 GBytes 8.36 Gbits/sec
> 2 sec 2.07 GBytes 8.28 Gbits/sec
> 2 sec 2.15 GBytes 8.59 Gbits/sec
> 2 sec 2.11 GBytes 8.44 Gbits/sec
> 2 sec 2.05 GBytes 8.20 Gbits/sec
> *2 sec 0.00 Bytes 0.00 bits/sec*
> *2 sec 0.00 Bytes 0.00 bits/sec*
> 2 sec 1.95 GBytes 7.81 Gbits/sec
> 2 sec 2.14 GBytes 8.55 Gbits/sec
> 2 sec 1.99 GBytes 7.97 Gbits/sec
> 2 sec 2.00 GBytes 8.01 Gbits/sec
> 2 sec 370.00 MBytes 1.45 Gbits/sec
> 2 sec 1.96 GBytes 7.85 Gbits/sec
> 2 sec 2.03 GBytes 8.12 Gbits/sec
> 2 sec 1.89 GBytes 7.58 Gbits/sec
> 2 sec 1.94 GBytes 7.77 Gbits/sec
> 2 sec 640.00 MBytes 2.50 Gbits/sec
> 2 sec 1.47 GBytes 5.90 Gbits/sec
> 2 sec 1.94 GBytes 7.77 Gbits/sec
> 2 sec 1.90 GBytes 7.62 Gbits/sec
> 2 sec 1.94 GBytes 7.77 Gbits/sec
> 2 sec 1.18 GBytes 4.73 Gbits/sec
> 2 sec 940.00 MBytes 3.67 Gbits/sec
> 2 sec 1.97 GBytes 7.89 Gbits/sec
> 2 sec 1.93 GBytes 7.73 Gbits/sec
> 2 sec 1.87 GBytes 7.46 Gbits/sec
> 2 sec 1.77 GBytes 7.07 Gbits/sec
> 2 sec 320.00 MBytes 1.25 Gbits/sec
> 2 sec 1.97 GBytes 7.89 Gbits/sec
> 2 sec 2.00 GBytes 8.01 Gbits/sec
> 2 sec 1.89 GBytes 7.58 Gbits/sec
> 2 sec 1.93 GBytes 7.73 Gbits/sec
> 2 sec 350.00 MBytes 1.37 Gbits/sec
> 2 sec 1.77 GBytes 7.07 Gbits/sec
> 2 sec 1.92 GBytes 7.70 Gbits/sec
> 2 sec 2.05 GBytes 8.20 Gbits/sec
> 2 sec 2.01 GBytes 8.05 Gbits/sec
> 2 sec 710.00 MBytes 2.77 Gbits/sec
> 2 sec 1.59 GBytes 6.37 Gbits/sec
> 2 sec 2.00 GBytes 8.01 Gbits/sec
> 2 sec 710.00 MBytes 2.77 Gbits/sec
> 2 sec 1.59 GBytes 6.37 Gbits/sec
> 2 sec 2.00 GBytes 8.01 Gbits/sec
> 2 sec 1.88 GBytes 7.54 Gbits/sec
> 2 sec 1.62 GBytes 6.48 Gbits/sec
>
>
> As you can see, although the peak bandwidth can reach 8.71Gbps, the
> performance is quite unstable(sometimes the bandwidth just gets
> chocked). All the OSS node seems to stop reading data simultaneously.
> I tried to group up different OSTs and turn on/off the checksum, this
> still happens. Does anybody get a hint of the reason?
>
> 2. As we know, when reading data from lustre client, the data is moved
> from OSS disk to its memory, and then send to the lustre client.
> Except for O_DIRECT, is there any other configuration to optimize the
> disk data access, such as using sendfile, splice or fio, which can
> greatly expedite the disk data access?
>
> fio: http://freshmeat.net/projects/fio/
>
> Any help will be greatly appreciated. Thanks!
>
>
>
> --
> Best regards,
>
> -----------------------------------------------------------------------------------------------
> Li, Tan
> PhD Candidate & Research Assistant,
> Electrical Engineering,
> Stony Brook University, NY
>
> Personal Web Site: https://sites.google.com/site/homepagelitan/Home
>
> Email: fanqielee at gmail.com <mailto:fanqielee at gmail.com>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
More information about the lustre-discuss
mailing list