[Lustre-discuss] Performances and fsync()

Sat Dec 5 17:14:37 PST 2009

On 2009-12-04, at 03:24, pascal.deveze at bull.net wrote:
> I am using b_eff_io to measure performance of ROMIO over Lustre  
> version
> 1.6.7.1.  I am using the new ADIO Lustre Driver and saw that  
> performances are very low.  The reason of that is because the write  
> bandwidth is calculated after a call to fsync().
>
> After investigations, I saw that even when the file is empty, the
> fsync takes 10 ms. If there are more than one process, the fsync calls
> seems to be serialized.  The time is 80 ms for 8 processes :
>
> salloc -n 8 -N 1 mpirun time-fsync -f /mnt/romio/FILE
> filename=/mnt/romio/FILE
> First sync (proc 0): 0.005534
> 03: sync : 0.019168 (err=0)
> 07: sync : 0.028794 (err=0)
> 01: sync : 0.038586 (err=0)
> 05: sync : 0.048467 (err=0)
> 02: sync : 0.058380 (err=0)
> 00: sync : 0.068205 (err=0)
> 04: sync : 0.078027 (err=0)
> 06: sync : 0.087960 (err=0)

Very strange.

> 1) Is this behaviour normal for Lustre ?

Not AFAIK.  For proper data conistency, Lustre is not only flushing  
the cache for the file descriptor (which is empty in this case), but  
is also sending an SYNC RPC to the MDS to ensure that the metadata for  
this file is persistent on disk.  From reading the code, a regular  
sys_fsync() _should_ only cause an MDS SYNC RPC, while sys_fdatasync()  
will also cause an OSS SYNC RPC for each stripe.  That said, I'm not  
100% sure the kernel has this right until the very latest kernels  
(i.e. 2.6.32).

I'm not sure of the exact semantics of fsync() in NFS, whether it is  
essentially a no-op when there is no dirty data in cache, because the  
writes themselves are always synchronous and there is no need to do  
anything on the server.

The Lustre RPCs _should_ all be happening in parallel, from looking at  
your program below, but it is possible that they are not arriving  
_quite_ at the same time on the server, and this is forcing an extra  
transaction commit for each RPC.  The times are about right - 10ms to  
do a seek on a disk, so this looks like about a single seek for each  
RPC.

> 2) Is is possible to configure something to make this fsync() run  
> better ?

Some filesystems (e.g. Reiser4) have the dubious optimization of  
disabling fsync() all together, because it slows down applications too  
much, but if applications are calling fsync() it is generally for a  
good reason (though, I admit, not always).

As for legitimately optimizing this, there are a few options.  The  
RPCs, and the corresponding file operations on the servers should  
happen in parallel, and I'm not sure why at least most of them are not  
being aggregated into the same transaction.  Getting debug logs from  
the servers and looking into why they are not grouped into a single  
transaction should identify what is causing the serialization.

Secondly, in Lustre 1.8 with Version Based Recovery, it would be  
possible for the MDS and OSS to determine if the file being fsync'd  
has any uncommitted changes, and if not then not do anything at all.   
With an fsync() (as opposed to a filesystem-wide "sync", the client  
should send the FID or object ID to the server to identify the file  
being fsync'd.   With VBR there is a version stored on each inode that  
contains the transaction number in which it was last modified.  If the  
inode version is older than the filesystem's last_committed   
transaction number, then it is already on stable storage and nothing  
needs to be done.  However, that is just my 5-minute investigation and  
there may be some hole in that logic.

> ================ source of time_fsync.c ==============================
> #include "mpi.h"
> #include <string.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
>
> int main(int argc, char **argv)
> {
> 	double t1;
> 	char *opt_filename;
> 	int mynod, fd, err;
> 	char ch;
>
> 	MPI_Init(&argc,&argv);
> 	MPI_Comm_rank(MPI_COMM_WORLD, &mynod);
>
> 	while ((ch = getopt( argc, argv, "f:" )) != EOF) {
> 		switch(ch) {
> 		case 'f':
> 			opt_filename = strdup(optarg);
> 			if (mynod == 0)
> 				printf("filename=%s \n", opt_filename);
> 			break;
> 		}
> 	}
>
> 	// Proc 0 opens/create the file
> 	if (mynod == 0) {
> 		fd = open(opt_filename, O_RDWR | O_CREAT, 0666);
>
> 		t1 = MPI_Wtime();
> 		fsync(fd);
> 		printf("First sync (proc 0): %.6f\n", MPI_Wtime()-t1);
>
> 		close(fd);
> 	}
>
> 	MPI_Barrier(MPI_COMM_WORLD);
> 	fd = open(opt_filename, O_RDWR);
> 	MPI_Barrier(MPI_COMM_WORLD);
>
> 	t1 = MPI_Wtime();
> 	err=fsync(fd);
> 	printf("%.2d: sync : %.6f (err=%d)\n", mynod, MPI_Wtime()-t1, err);
>
> 	close(fd);
> 	MPI_Finalize();
>
> 	return 0;
> }

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.