[lustre-discuss] Writing to a single big file is slower

Wed Oct 10 14:21:15 PDT 2018

On Oct 10, 2018, at 15:01, Kal Alfizah <kalfizah at outlook.com> wrote:
> 
> Hello,
> 
> Doing IOR on single node to lustre fs. And notice write to a big single file is slower. I would think write to many small files will be slower. Any ideas why is it? And if there is any lustre setting able to fix it. It's lustre-2.10.4.
> 
> # mpirun -np 32  /temp/ior/bin/ior -a POSIX -C -v -w -k -F -i 1 -t 1m -b 8G -o /mnt/lustrefs/begdon/test2/eachfile-256G
> ...
> Max Write: 4495.70 MiB/sec (4714.08 MB/sec)
> ...
> 
> # mpirun -np 32 /temp/ior/bin/ior -a POSIX -C -v -w -k -i 1 -t 1m -b 8G -o /mnt/lustrefs/begdon/test2/bigfile-256G
> ...
> Max Write: 1331.97 MiB/sec (1396.67 MB/sec)
> ...

Kal,
this is because the file-per-process output does not have any contention between client threads, which means each file gets a single LDLM lock and the client writes a contiguous stream of data to the one file.  Creating only 32 files has no noticeable overhead, so this is not a factor in the performance.  Once there are many thousands/millions of files the file creation overhead will become more significant.

The shared-single-file output has to contend between threads, so there is LDLM locking overhead between threads/nodes.  If all of the threads are on a single client, then it can also potentially cause issues where the RPCs are not formed properly if there is too much dirty data on the client, but it is spread around the file.

That said, I wouldn't expect the difference to be so large.  Is it possible that the shared-single-file case is only using a single OST stripe for the output?  With Lustre 2.10+ you can create a progressive file layout (PFL) that will distribute the IO across OSTs if the file gets larger.  Something like:

client$ lfs setstripe -E1G -c1 -E16G -c4 -E-1 -c-1 /mnt/lustrefs/begdon/test2

which will use a single stripe below 1GB, 4 stripes up to 16GB, and fully striped after 16GB (you can change these values arbitrarily per file or directory).  That will ensure that small (< 1GB) files do not have much overhead, but very large files can use the full IO bandwidth.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud