[Lustre-discuss] 1.8.1.1 write slow performance :/

Mon Nov 9 07:50:04 PST 2009

On Sun, 2009-11-08 at 21:52 +0100, Piotr Wadas wrote:
> I just did some speed tests between client and filesystem server,
> with dedicated GbitEthernet connection, I compared uploading via 
> lustre-mounted share, and uploading to the same share, mounted
> as loopback lustre client on filesystem server and reexported via nfs.

I'm not sure I understand the configuration of your "loopback" set up.
Could you provide more details as to what exactly you mean?

> nfs clinet => nfs server => loopback lustre server => drbd resource
> "X".

I don't understand why you need to introduce NFS in all of this.

> aleft:~# mount -t lustre
> /dev/drbd0 on /mnt/mgs type lustre (rw,noauto)
> /dev/drbd1 on /mnt/mdt type lustre (rw,noauto,_netdev)
> /dev/drbd2 on /mnt/ost01 type lustre (rw,noauto,_netdev)
> master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)

So "master" is your lustre server with both mds and ost on it?  This
will be a sub-optimal configuration due to the seeking being done
between the MDT and OST.  If you really do have only one machine
available to you, then likely plain old NFS will perform better for you.

Lustre doesn't really begin to shine until you can throw more resources
at it.  In that situation, Lustre starts to outperform NFS by
efficiently utilizing the many machines you give it.

> b02:~# mount -t lustre
> master at tcp0:/lfs00 on /mnt/lfs00 type lustre (rw,noauto,_netdev)
> b02:~# mount -t nfs |grep master
> master:/mnt/lfs00 on /mnt/nfs00 type nfs (rw,addr=192.168.0.100)

Ahhh. Maybe your NFS scenario is more clear.  On the "master" server you
have mounted a Lustre client and have exported that via NFS?

> time dd if=/dev/zero of=/mnt/lfs00/testfile-b02 bs=1024 count=102400
                                                  ^^^^^^^ ^^^^^^^^^^^^
Try increasing the block size (and reducing the count if you want to
send the same amount of data).  Try a block size of 1M (and count of 100
to make the dataset the same size if you wish).

> lfs00-get
> time dd of=testfile-b02 if=/mnt/lfs00/testfile-b02 bs=1024
> count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 0.987265 s, 106 MB/s
> 
> real    0m0.989s
> user    0m0.040s
> sys     0m0.880s

This result is likely demonstrating readahead.

> b02:~# ./100mb-nfs.sh 
> 
> nfs00-send
> time dd if=/dev/zero of=/mnt/nfs00/testfile-b02 bs=1024 count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 1.05942 s, 99.0 MB/s

Probably something in the NFS stack is coalescing the small writes into
larger writes before sending them over the wire to the server.

> nfs00-get
> time dd of=testfile-b02 if=/mnt/nfs00/testfile-b02 bs=1024
> count=102400
> 102400+0 records in
> 102400+0 records out
> 104857600 bytes (105 MB) copied, 0.576351 s, 182 MB/s

What is /dev/drbd2?  Is it a pair of (individual) disks or an array of
some sort?  If individual disks, you have to agree that 182 MB/s is
unrealistic, yes?  Likely you are measuring the speed of caching here.

Try increasing your dataset size so that it exceeds the ability of the
cache to help out.  Probably a few 10s of GB will do it.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091109/8148838c/attachment.pgp>