[Lustre-discuss] Performances and fsync()

Sun Dec 6 11:32:02 PST 2009

[ ... ]

>> 1.6.7.1.  I am using the new ADIO Lustre Driver and saw that
>> performances are very low. The reason of that is because the
>> write bandwidth is calculated after a call to fsync().

This is storage systems FAQ #1: committed IOP performance is not
the same as streaming buffered performance.

>> After investigations, I saw that even when the file is empty,
>> the fsync takes 10 ms.

That's pretty obvious:

[ ... ]

> [ ... ] 10ms to do a seek on a disk, so this looks like about
> a single seek for each RPC.

>> If there are more than one process, the fsync calls seems to
>> be serialized.  The time is 80 ms for 8 processes : [ ... ]

>> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
>> filename=/mnt/romio/FILE
>> First sync (proc 0): 0.005534
>> 03: sync : 0.019168 (err=0)
>> 07: sync : 0.028794 (err=0)
>> 01: sync : 0.038586 (err=0)
>> 05: sync : 0.048467 (err=0)
>> 02: sync : 0.058380 (err=0)
>> 00: sync : 0.068205 (err=0)
>> 04: sync : 0.078027 (err=0)
>> 06: sync : 0.087960 (err=0)

> Very strange.

Note necessarily -- IIRC Lustre metdata are not "striped". The
file is empty, and nothing is written to ti, so there should be no
traffic to the OSTs, only to the currently active MDT.

>> The same programm on an NFS file gives less than 5
>> microseconds for the same fsync() calls on 8 processes: [
>> ... ]

The NFS mount options are likely "wrong". Note that you are
using 'fsync' and not 'fdatasync', and perhaps 'noatime' is
set differently between Lustre and NFS.

>> 2) Is is possible to configure something to make this fsync()
>> run better ?

Well, a beginner text on storage systems and file systems and
transactions would be a start, so at least there would be a
basic understanding of why 10ms per 'fsync' is probably right
and 5us per 8 'fsync's is probably wrong.

> Some filesystems (e.g. Reiser4) have the dubious optimization
> of disabling fsync() all together, because it slows down
> applications too much, but if applications are calling fsync()
> it is generally for a good reason (though, I admit, not
> always).

More broadly the problem usually is that applications don't even
issue 'fsync', and very few people seem to have read any beginner
text on storage systems and file systems and transactions and
understand why it matters and when, and this is one aspect of the
"userspace sucks" issue. Some links that AndrewD probably knows
well:

    http://sandeen.net/wordpress/?p=34
    http://sandeen.net/wordpress/?p=42
    http://mjg59.livejournal.com/108257.html
    http://tribulaciones.org/2009/03/is-ext4-unsafe/
    http://lwn.net/SubscriberLink/322823/e6979f02e5a73feb/
    http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
    http://loupgaroublond.blogspot.com/2009/03/anecdote-about-why-doing-wrong-thing-is.html
    https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45

> As for legitimately optimizing this, there are a few options.

Perhaps the best option is to use low latency storage media.
That's the only really good way to get IOP/s up in sustained
way. IIRC AndreasD has repeatedly recommended to use good SSDs
if one wants fast MDTs (but writing an SSD page can be slow).
Battery backed RAM seems also important.

[ ... ]

> Secondly, in Lustre 1.8 with Version Based Recovery, it would be
> possible for the MDS and OSS to determine if the file being
> fsync'd has any uncommitted changes, and if not then not do
> anything at all.

I suspect that there is an important difference here between
'fsync' and 'fdatasync'.

[ ... ]