[lustre-discuss] Lustre 2.9 performance issues

Dilger, Andreas andreas.dilger at intel.com
Thu Apr 27 16:21:14 PDT 2017


On Apr 25, 2017, at 13:11, Bass, Ned <bass6 at llnl.gov> wrote:
> 
> Hi Darby,
> 
>> -----Original Message-----
>> 
>> for i in $(seq 0 99) ; do
>>   dd if=/dev/zero of=dd.dat.$i bs=1k count=1 conv=fsync > /dev/null 2>&1
>> done
>> 
>> The timing of this ranges from 0.1 to 1 sec on our old LFS but ranges from 20
>> to 60 sec on our newer 2.9 LFS.  
> 
> Because Lustre does not yet use the ZFS Intent Log (ZIL), it implements fsync() by
> waiting for an entire transaction group to get written out. This can incur long
> delays on a busy filesystem as the transaction groups become quite large. Work
> on implementing ZIL support is being tracked in LU-4009 but this feature is not
> expected to make it into the upcoming 2.10 release.

There is also the patch that was developed in the past to test this:
https://review.whamcloud.com/7761 "LU-4009 osd-zfs: Add tunables to disable sync"
which allows disabling ZFS to wait for TXG commit for each sync on the servers.

That may be an acceptable workaround in the meantime.  Essentially, clients would
_start_ a sync on the server, but would not wait for completion before returning
to the application.  Both the client and the OSS would need to crash within a few
seconds of the sync for it to be lost.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list