[lustre-discuss] Lustre 2.9 performance issues

Vicker, Darby (JSC-EG311) darby.vicker-1 at nasa.gov
Sun Apr 30 21:24:21 PDT 2017


This worked great.  We implemented it on Friday and the timings of the dd test on our 2.9/ZFS LFS have dropped to under a second.  Thanks a lot.  The risk of both the client and OSS crashing within a few seconds is low enough for us compared to the performance gain.  

The commit you pointed to didn't apply cleanly to the 2.9 source.  Please let me know if you want us to upload an updated patch to that LU (or post it to this list).  

-----Original Message-----
From: "Dilger, Andreas" <andreas.dilger at intel.com>
Date: Thursday, April 27, 2017 at 6:21 PM
To: "Bass, Ned" <bass6 at llnl.gov>, Darby Vicker <darby.vicker-1 at nasa.gov>
Cc: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre 2.9 performance issues

On Apr 25, 2017, at 13:11, Bass, Ned <bass6 at llnl.gov> wrote:
> 
> Hi Darby,
> 
>> -----Original Message-----
>> 
>> for i in $(seq 0 99) ; do
>>   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
>> done
>> 
>> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
>> to 60 sec on our newer 2.9 LFS.  
> 
> Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements fsync() by
> waiting for an entire transaction group to get written out. This can incur long
> delays on a busy filesystem as the transaction groups become quite large. Work
> on implementing ZIL support is being tracked in LU-4009 but this feature is not
> expected to make it into the upcoming 2.10 release.

There is also the patch that was developed in the past to test this:
https://review.whamcloud.com/7761 "LU-4009 osd-zfs: Add tunables to disable sync"
which allows disabling ZFS to wait for TXG commit for each sync on the servers.

That may be an acceptable workaround in the meantime.  Essentially, clients would
_start_ a sync on the server, but would not wait for completion before returning
to the application.  Both the client and the OSS would need to crash within a few
seconds of the sync for it to be lost.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation











More information about the lustre-discuss mailing list