[Lustre-discuss] Performances and fsync()
Andreas Dilger
adilger at sun.com
Sat Dec 5 17:14:37 PST 2009
On 2009-12-04, at 03:24, pascal.deveze at bull.net wrote:
> I am using b_eff_io to measure performance of ROMIO over Lustre
> version
> 1.6.7.1. I am using the new ADIO Lustre Driver and saw that
> performances are very low. The reason of that is because the write
> bandwidth is calculated after a call to fsync().
>
> After investigations, I saw that even when the file is empty, the
> fsync takes 10 ms. If there are more than one process, the fsync calls
> seems to be serialized. The time is 80 ms for 8 processes :
>
> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
> filename=/mnt/romio/FILE
> First sync (proc 0): 0.005534
> 03: sync : 0.019168 (err=0)
> 07: sync : 0.028794 (err=0)
> 01: sync : 0.038586 (err=0)
> 05: sync : 0.048467 (err=0)
> 02: sync : 0.058380 (err=0)
> 00: sync : 0.068205 (err=0)
> 04: sync : 0.078027 (err=0)
> 06: sync : 0.087960 (err=0)
Very strange.
> 1) Is this behaviour normal for Lustre ?
Not AFAIK. For proper data conistency, Lustre is not only flushing
the cache for the file descriptor (which is empty in this case), but
is also sending an SYNC RPC to the MDS to ensure that the metadata for
this file is persistent on disk. From reading the code, a regular
sys_fsync() _should_ only cause an MDS SYNC RPC, while sys_fdatasync()
will also cause an OSS SYNC RPC for each stripe. That said, I'm not
100% sure the kernel has this right until the very latest kernels
(i.e. 2.6.32).
I'm not sure of the exact semantics of fsync() in NFS, whether it is
essentially a no-op when there is no dirty data in cache, because the
writes themselves are always synchronous and there is no need to do
anything on the server.
The Lustre RPCs _should_ all be happening in parallel, from looking at
your program below, but it is possible that they are not arriving
_quite_ at the same time on the server, and this is forcing an extra
transaction commit for each RPC. The times are about right - 10ms to
do a seek on a disk, so this looks like about a single seek for each
RPC.
> 2) Is is possible to configure something to make this fsync() run
> better ?
Some filesystems (e.g. Reiser4) have the dubious optimization of
disabling fsync() all together, because it slows down applications too
much, but if applications are calling fsync() it is generally for a
good reason (though, I admit, not always).
As for legitimately optimizing this, there are a few options. The
RPCs, and the corresponding file operations on the servers should
happen in parallel, and I'm not sure why at least most of them are not
being aggregated into the same transaction. Getting debug logs from
the servers and looking into why they are not grouped into a single
transaction should identify what is causing the serialization.
Secondly, in Lustre 1.8 with Version Based Recovery, it would be
possible for the MDS and OSS to determine if the file being fsync'd
has any uncommitted changes, and if not then not do anything at all.
With an fsync() (as opposed to a filesystem-wide "sync", the client
should send the FID or object ID to the server to identify the file
being fsync'd. With VBR there is a version stored on each inode that
contains the transaction number in which it was last modified. If the
inode version is older than the filesystem's last_committed
transaction number, then it is already on stable storage and nothing
needs to be done. However, that is just my 5-minute investigation and
there may be some hole in that logic.
> ================ source of time_fsync.c ==============================
> #include "mpi.h"
> #include <string.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, char **argv)
> {
> double t1;
> char *opt_filename;
> int mynod, fd, err;
> char ch;
>
> MPI_Init(&argc,&argv);
> MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
>
> while ((ch = getopt( argc, argv, "f:" )) != EOF) {
> switch(ch) {
> case 'f':
> opt_filename = strdup(optarg);
> if (mynod == 0)
> printf("filename=%s \n", opt_filename);
> break;
> }
> }
>
> // Proc 0 opens/create the file
> if (mynod == 0) {
> fd = open(opt_filename, O_RDWR | O_CREAT, 0666);
>
> t1 = MPI_Wtime();
> fsync(fd);
> printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1);
>
> close(fd);
> }
>
> MPI_Barrier(MPI_COMM_WORLD);
> fd = open(opt_filename, O_RDWR);
> MPI_Barrier(MPI_COMM_WORLD);
>
> t1 = MPI_Wtime();
> err=fsync(fd);
> printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1, err);
>
> close(fd);
> MPI_Finalize();
>
> return 0;
> }
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list