[lustre-discuss] fio and lustre performance

Thu Aug 25 10:29:03 PDT 2022

Hi all,

I'm trying to figure out an odd behavior when running an fio ( 
https://git.kernel.dk/cgit/fio/ <https://git.kernel.dk/cgit/fio/> ) 
benchmark on a Lustre file system.

fio--randrepeat=1  \
     --ioengine=posixaio  \
     --buffered=1  \
     --gtod_reduce=1  \
     --name=test  \
     --filename=${fileName}  \
     --bs=1M  \
     --iodepth=64  \
     --size=40G  \
     --readwrite=randwrite

In short, the application queues 40,000 random aio_write64(nbyte=1M) to 
a maximum depth of 64, doing aio_suspend64 followed by aio_write to keep 
64 outstanding aio requests.  My I/O library that processes the aio 
requests does so with 4 pthreads removing aio requests from the queue 
and doing the I/Os as pwrite64()s.  The odd behavior is the intermittent 
pauses that can been seen in the first plot below.  The X-axis is wall 
clock time, in seconds, and the left Y-axis is file postition. The 
horizontal blue lines indicate the amount of time each of the pwrite64 
is active and where in the file the I/O is occurring. The right Y-axis 
is the cumulative cpu times for both the process and kernel during the 
run.  There is minimal user cpu time, for either the process or kernel.  
The cumulative system cpu time attributable to the process ( the red 
line ) runs at a slope of ~4 system cpu seconds per wall clock second.  
Makes sense since there are 4 pthreads at work in the user process.  The 
cumulative system cpu time for the kernel as a whole ( the green line ) 
is ~12 system cpu seconds per wall clock second.  Note that during the 
pauses the system cpu accumulation drops to near zero ( zero slope ).

This is run on a dedicated ivybridge node with 40 cores Intel(R) Xeon(R) 
CPU E5-2680 v2 @ 2.80GHz

The node has 64G of memory.

The file is striped single component PFL, 8x1M.  Lustre version *2.12.8 
ddn12*

Does anyone have any ideas what is causing the pauses?  Is there 
something else I could be looking at in the /proc file system to gain 
insight?

For comparison, the 2nd plot below is when run on /tmp.  Note that there 
are some pwrite64() that take a long time, but a single pwrite64() 
taking a long time does not stop all the other pwrite64() active during 
the same time period.  Elapsed time for /tmp is 13 seconds. Lustre is 28 
seconds.  Both are essentially memory resident.

Just for completeness I have added a 3rd plot which is the amount of 
memory each of the OSC clients is consuming over the course of the 
Lustre run.  Nothing unusual there.  The memory consumption rate slows 
down during the pauses as one would expect.

I don't think the instrumentation is the issue, as there is not much 
more instrumentation occurring in the Lustre run versus /tmp, and they 
are both less than 6MB each in total.

John

In case the images got stripped here are some URLs to dropbox

plot1 : https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0

plot2 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0

plot3 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0