[lustre-discuss] fio and lustre performance

John Bauer bauerj at iodoctors.com
Fri Aug 26 07:26:53 PDT 2022


I reran the job, changing only --ioengine=posixaio to 
--ioengine=libaio.  Since I don't intercept the functions in libaio I 
don't see much for application calls except for the open and close.  But 
the plot of osc cache usage versus time gives a good feel for what is 
going on.  The first plot below is the osc cache usage when using 
libaio.  The second plot below is the osc cache usage when using the 
pthread aio handler (paio).  I should point out that the pthread aio 
handler I'm using is not the one in librt.so, but rather one that I 
wrote. (My apologies on the previous email for my having pasted the 
wrong URL for plot 3 which is the second plot in this email ).  Note 
that the libaio run took 50 seconds, and the pthread aio handler took 28 
seconds.

This time, for completeness, I ran the job using --ioengine=posixaio, 
but using the aio calls in librt.so.  The third plot below is the osc 
cache usage for this run.  The I/O phase here took 48 seconds.  The 
fourth plot below is the file position activity for this run.  Note that 
again there are the pauses that seem to stop all progress.  Perhaps 
since the job is running considerably slower, the write behind is 
keeping up with the dirty data being generated and we see fewer pauses 
than the initial run.  I will investigate the max_dirty_mb and 
max_rpcs_in_flight angle and send an update later.

I have no complaints about the performance in general when using my 
pthread aio handler.  I am mostly curious about the source of the 
pauses.  They always show up, but not in any regular pattern. Sometimes 
the long pauses have very few aio requests between them, sometimes not.

John

libaio.osc_cached : 
https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0

https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0


paio.osc_cached : 
https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0

https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0


librt.osc_cached : 
https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0

https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0

librt.fpa : https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0

https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0
On 8/25/22 18:37, Andreas Dilger wrote:
> No comment on the actual performance issue, but we normally test fio 
> using the libaio interface (which is handled in the kernel) instead of 
> posixaio (which is handled by threads in userspace, AFAIK), and also 
> use DirectIO to avoid memory copies (OK if there are enough IO 
> requests in flight).  it should be a relatively easy change to see if 
> that improves the behaviour.
>
> Other things to check - osc.*.max_dirty_mb and llite.*.max_cached_mb 
> are not hitting limits and throttling IO until the data is flushed, 
> and osc.*.max_rpcs_in_flight across the OSCs are *at least* 64 to keep 
> up with the input generation.
>
> When you consider that Lustre (distributed coherent persistent network 
> filesystem) is "only half" of the performance (28s vs 13s) of a local 
> ephemeral RAM-based filesystem, it isn't doing too badly...
>
> Cheers, Andreas
>
>> On Aug 25, 2022, at 11:29, John Bauer <bauerj at iodoctors.com> wrote:
>>
>> Hi all,
>>
>> I'm trying to figure out an odd behavior when running an fio ( 
>> https://git.kernel.dk/cgit/fio/ <https://git.kernel.dk/cgit/fio/> ) 
>> benchmark on a Lustre file system.
>>
>> fio--randrepeat=1  \
>>    --ioengine=posixaio  \
>>    --buffered=1  \
>>    --gtod_reduce=1  \
>>    --name=test  \
>>    --filename=${fileName}  \
>>    --bs=1M  \
>>    --iodepth=64  \
>>    --size=40G  \
>>    --readwrite=randwrite
>>
>> In short, the application queues 40,000 random aio_write64(nbyte=1M) 
>> to a maximum depth of 64, doing aio_suspend64 followed by aio_write 
>> to keep 64 outstanding aio requests.  My I/O library that processes 
>> the aio requests does so with 4 pthreads removing aio requests from 
>> the queue and doing the I/Os as pwrite64()s.  The odd behavior is the 
>> intermittent pauses that can been seen in the first plot below.  The 
>> X-axis is wall clock time, in seconds, and the left Y-axis is file 
>> postition. The horizontal blue lines indicate the amount of time each 
>> of the pwrite64 is active and where in the file the I/O is occurring. 
>> The right Y-axis is the cumulative cpu times for both the process and 
>> kernel during the run.  There is minimal user cpu time, for either 
>> the process or kernel.  The cumulative system cpu time attributable 
>> to the process ( the red line ) runs at a slope of ~4 system cpu 
>> seconds per wall clock second.  Makes sense since there are 4 
>> pthreads at work in the user process.  The cumulative system cpu time 
>> for the kernel as a whole ( the green line ) is ~12 system cpu 
>> seconds per wall clock second. Note that during the pauses the system 
>> cpu accumulation drops to near zero ( zero slope ).
>>
>> This is run on a dedicated ivybridge node with 40 cores Intel(R) 
>> Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>
>> The node has 64G of memory.
>>
>> The file is striped single component PFL, 8x1M.  Lustre version 
>> *2.12.8 ddn12*
>>
>> Does anyone have any ideas what is causing the pauses? Is there 
>> something else I could be looking at in the /proc file system to gain 
>> insight?
>>
>> For comparison, the 2nd plot below is when run on /tmp. Note that 
>> there are some pwrite64() that take a long time, but a single 
>> pwrite64() taking a long time does not stop all the other pwrite64() 
>> active during the same time period.  Elapsed time for /tmp is 13 
>> seconds. Lustre is 28 seconds.  Both are essentially memory resident.
>>
>> Just for completeness I have added a 3rd plot which is the amount of 
>> memory each of the OSC clients is consuming over the course of the 
>> Lustre run.  Nothing unusual there.  The memory consumption rate 
>> slows down during the pauses as one would expect.
>>
>> I don't think the instrumentation is the issue, as there is not much 
>> more instrumentation occurring in the Lustre run versus /tmp, and 
>> they are both less than 6MB each in total.
>>
>> John
>>
>> In case the images got stripped here are some URLs to dropbox
>>
>> plot1 : https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0
>>
>> plot2 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0
>>
>> plot3 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0
>>
>>
>>
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220826/2ba0d6ac/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: libaio.osc_cached.png
Type: image/png
Size: 15900 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220826/2ba0d6ac/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: paio.osc_cached.png
Type: image/png
Size: 16285 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220826/2ba0d6ac/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: librt.osc_cached.png
Type: image/png
Size: 15905 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220826/2ba0d6ac/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: librt.fpa.png
Type: image/png
Size: 58050 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220826/2ba0d6ac/attachment-0007.png>


More information about the lustre-discuss mailing list