[lustre-discuss] Writing to a single big file is slower

Wed Oct 10 19:13:13 PDT 2018

Note also that there are various sources of contention when many processes are writing to one file.  Files have various aspects that have to be kept in sync when being written (size, etc), and keeping them up to date naturally generates some contention.

So in general, file per process rather than shared file workloads will give superior write  performance.  There are various things you can do to improve this, but this basic fact remains.

________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Kal Alfizah <kalfizah at outlook.com>
Sent: Wednesday, October 10, 2018 5:26:11 PM
To: Andreas Dilger
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] Writing to a single big file is slower

Thank you Andreas, appreciate your kind and clear explanation.

I set 3 dirs, test1, test2 and test3. Each with different setstripe, test1 with -S 0 -c 1, test2 with -S 0 -c -1 and test3 with -S 16M -c -1. But the result is about the same. I didn't try PFL yet.

I will just use file-per-process output for now.

Cheers,

Kal

________________________________
From: Andreas Dilger <adilger at whamcloud.com>
Sent: Wednesday, October 10, 2018 2:21:15 PM
To: Kal Alfizah
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] Writing to a single big file is slower

On Oct 10, 2018, at 15:01, Kal Alfizah <kalfizah at outlook.com> wrote:
>
> Hello,
>
> Doing IOR on single node to lustre fs. And notice write to a big single file is slower. I would think write to many small files will be slower. Any ideas why is it? And if there is any lustre setting able to fix it. It's lustre-2.10.4.
>
> # mpirun -np 32  /temp/ior/bin/ior -a POSIX -C -v -w -k -F -i 1 -t 1m -b 8G -o /mnt/lustrefs/begdon/test2/eachfile-256G
> ...
> Max Write: 4495.70 MiB/sec (4714.08 MB/sec)
> ...
>
> # mpirun -np 32 /temp/ior/bin/ior -a POSIX -C -v -w -k -i 1 -t 1m -b 8G -o /mnt/lustrefs/begdon/test2/bigfile-256G
> ...
> Max Write: 1331.97 MiB/sec (1396.67 MB/sec)
> ...

Kal,
this is because the file-per-process output does not have any contention between client threads, which means each file gets a single LDLM lock and the client writes a contiguous stream of data to the one file.  Creating only 32 files has no noticeable overhead, so this is not a factor in the performance.  Once there are many thousands/millions of files the file creation overhead will become more significant.

The shared-single-file output has to contend between threads, so there is LDLM locking overhead between threads/nodes.  If all of the threads are on a single client, then it can also potentially cause issues where the RPCs are not formed properly if there is too much dirty data on the client, but it is spread around the file.

That said, I wouldn't expect the difference to be so large.  Is it possible that the shared-single-file case is only using a single OST stripe for the output?  With Lustre 2.10+ you can create a progressive file layout (PFL) that will distribute the IO across OSTs if the file gets larger.  Something like:

client$ lfs setstripe -E1G -c1 -E16G -c4 -E-1 -c-1 /mnt/lustrefs/begdon/test2

which will use a single stripe below 1GB, 4 stripes up to 16GB, and fully striped after 16GB (you can change these values arbitrarily per file or directory).  That will ensure that small (< 1GB) files do not have much overhead, but very large files can use the full IO bandwidth.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181011/b2ab5790/attachment-0001.html>