<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>I reran the job, changing only --ioengine=posixaio to
--ioengine=libaio. Since I don't intercept the functions in
libaio I don't see much for application calls except for the open
and close. But the plot of osc cache usage versus time gives a
good feel for what is going on. The first plot below is the osc
cache usage when using libaio. The second plot below is the osc
cache usage when using the pthread aio handler (paio). I should
point out that the pthread aio handler I'm using is not the one in
librt.so, but rather one that I wrote. (My apologies on the
previous email for my having pasted the wrong URL for plot 3 which
is the second plot in this email ). Note that the libaio run took
50 seconds, and the pthread aio handler took 28 seconds.</p>
<p>This time, for completeness, I ran the job using
--ioengine=posixaio, but using the aio calls in librt.so. The
third plot below is the osc cache usage for this run. The I/O
phase here took 48 seconds. The fourth plot below is the file
position activity for this run. Note that again there are the
pauses that seem to stop all progress. Perhaps since the job is
running considerably slower, the write behind is keeping up with
the dirty data being generated and we see fewer pauses than the
initial run. I will investigate the max_dirty_mb and
max_rpcs_in_flight angle and send an update later.</p>
<p>I have no complaints about the performance in general when using
my pthread aio handler. I am mostly curious about the source of
the pauses. They always show up, but not in any regular pattern.
Sometimes the long pauses have very few aio requests between them,
sometimes not.<br>
</p>
<p>John</p>
<p>libaio.osc_cached :
<a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0">https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0</a></p>
<p><img moz-do-not-send="false"
src="cid:part1.2ig6Emxc.twUnNDM4@iodoctors.com"
alt="https://www.dropbox.com/s/7z8taqr2u9g139l/libaio.osc_cached.png?dl=0"
width="1067" height="680"></p>
<p><br>
</p>
<p>paio.osc_cached :
<a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0">https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0</a></p>
<p><img moz-do-not-send="false"
src="cid:part2.ldjEGtNd.FTwu3jy7@iodoctors.com"
alt="https://www.dropbox.com/s/l6bq8rpz6ij7vdo/paio.osc_cached.png?dl=0"
width="1067" height="680"></p>
<p><br>
</p>
<p>librt.osc_cached :
<a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0">https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0</a></p>
<p><img moz-do-not-send="false"
src="cid:part3.F7WJ8GpK.pxoo1R0v@iodoctors.com"
alt="https://www.dropbox.com/s/4lh0dz9b286ci2l/librt.osc_cached.png?dl=0"
width="1067" height="681"></p>
<p>librt.fpa :
<a class="moz-txt-link-freetext" href="https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0">https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0</a></p>
<img moz-do-not-send="false"
src="cid:part4.deX4aaVl.ce49Hyoq@iodoctors.com"
alt="https://www.dropbox.com/s/5sy67ol3gdl49rg/librt.fpa.png?dl=0"
width="1067" height="681"><br>
<div class="moz-cite-prefix">On 8/25/22 18:37, Andreas Dilger wrote:<br>
</div>
<blockquote type="cite"
cite="mid:D8D6837D-0F4E-4C17-8F07-F0B202E38664@ddn.com">
No comment on the actual performance issue, but we normally test
fio using the libaio interface (which is handled in the kernel)
instead of posixaio (which is handled by threads in userspace,
AFAIK), and also use DirectIO to avoid memory copies (OK if there
are enough IO requests in flight). it should be a relatively easy
change to see if that improves the behaviour.
<div class=""><br class="">
</div>
<div class="">Other things to check - osc.*.max_dirty_mb and
llite.*.max_cached_mb are not hitting limits and throttling IO
until the data is flushed, and osc.*.max_rpcs_in_flight across
the OSCs are *at least* 64 to keep up with the input generation.</div>
<div class=""><br class="">
</div>
<div class="">When you consider that Lustre (distributed coherent
persistent network filesystem) is "only half" of the performance
(28s vs 13s) of a local ephemeral RAM-based filesystem, it isn't
doing too badly...</div>
<div class=""><br class="">
</div>
<div class="">Cheers, Andreas<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 25, 2022, at 11:29, John Bauer <<a
href="mailto:bauerj@iodoctors.com"
class="moz-txt-link-freetext" moz-do-not-send="true">bauerj@iodoctors.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">Hi all,<br class="">
<br class="">
I'm trying to figure out an odd behavior when running an
fio ( <a href="https://git.kernel.dk/cgit/fio/"
class="moz-txt-link-freetext" moz-do-not-send="true">
https://git.kernel.dk/cgit/fio/</a> <<a
href="https://git.kernel.dk/cgit/fio/"
class="moz-txt-link-freetext" moz-do-not-send="true">https://git.kernel.dk/cgit/fio/</a>>
) benchmark on a Lustre file system.<br class="">
<br class="">
fio--randrepeat=1 \<br class="">
--ioengine=posixaio \<br class="">
--buffered=1 \<br class="">
--gtod_reduce=1 \<br class="">
--name=test \<br class="">
--filename=${fileName} \<br class="">
--bs=1M \<br class="">
--iodepth=64 \<br class="">
--size=40G \<br class="">
--readwrite=randwrite<br class="">
<br class="">
In short, the application queues 40,000 random
aio_write64(nbyte=1M) to a maximum depth of 64, doing
aio_suspend64 followed by aio_write to keep 64
outstanding aio requests. My I/O library that processes
the aio requests does so with 4 pthreads removing aio
requests from the queue and doing the I/Os as
pwrite64()s. The odd behavior is the intermittent
pauses that can been seen in the first plot below. The
X-axis is wall clock time, in seconds, and the left
Y-axis is file postition. The horizontal blue lines
indicate the amount of time each of the pwrite64 is
active and where in the file the I/O is occurring. The
right Y-axis is the cumulative cpu times for both the
process and kernel during the run. There is minimal
user cpu time, for either the process or kernel. The
cumulative system cpu time attributable to the process (
the red line ) runs at a slope of ~4 system cpu seconds
per wall clock second. Makes sense since there are 4
pthreads at work in the user process. The cumulative
system cpu time for the kernel as a whole ( the green
line ) is ~12 system cpu seconds per wall clock second.
Note that during the pauses the system cpu accumulation
drops to near zero ( zero slope ).<br class="">
<br class="">
This is run on a dedicated ivybridge node with 40 cores
Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz<br class="">
<br class="">
The node has 64G of memory.<br class="">
<br class="">
The file is striped single component PFL, 8x1M. Lustre
version *2.12.8 ddn12*<br class="">
<br class="">
Does anyone have any ideas what is causing the pauses?
Is there something else I could be looking at in the
/proc file system to gain insight?<br class="">
<br class="">
For comparison, the 2nd plot below is when run on /tmp.
Note that there are some pwrite64() that take a long
time, but a single pwrite64() taking a long time does
not stop all the other pwrite64() active during the same
time period. Elapsed time for /tmp is 13 seconds.
Lustre is 28 seconds. Both are essentially memory
resident.<br class="">
<br class="">
Just for completeness I have added a 3rd plot which is
the amount of memory each of the OSC clients is
consuming over the course of the Lustre run. Nothing
unusual there. The memory consumption rate slows down
during the pauses as one would expect.<br class="">
<br class="">
I don't think the instrumentation is the issue, as there
is not much more instrumentation occurring in the Lustre
run versus /tmp, and they are both less than 6MB each in
total.<br class="">
<br class="">
John<br class="">
<br class="">
In case the images got stripped here are some URLs to
dropbox<br class="">
<br class="">
plot1 : <a
href="https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0"
class="moz-txt-link-freetext" moz-do-not-send="true">
https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0</a><br class="">
<br class="">
plot2 : <a
href="https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0"
class="moz-txt-link-freetext" moz-do-not-send="true">
https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0</a><br class="">
<br class="">
plot3 : <a
href="https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0"
class="moz-txt-link-freetext" moz-do-not-send="true">
https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0</a><br class="">
<br class="">
<br class="">
<br class="">
<br class="">
<br class="">
_______________________________________________<br
class="">
lustre-discuss mailing list<br class="">
<a href="mailto:lustre-discuss@lists.lustre.org"
class="moz-txt-link-freetext" moz-do-not-send="true">lustre-discuss@lists.lustre.org</a><br
class="">
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br
class="">
</div>
</div>
</blockquote>
</div>
<br class="">
<div class="">
<div dir="auto" class="">
<div dir="auto" class="">
<div dir="auto" class="">
<div dir="auto" class="">
<div dir="auto" class="">
<div dir="auto" class="">
<div>Cheers, Andreas</div>
<div>--</div>
<div>Andreas Dilger</div>
<div>Lustre Principal Architect</div>
<div>Whamcloud</div>
<div><br class="">
</div>
<div><br class="">
</div>
<div><br class="">
</div>
</div>
</div>
</div>
</div>
</div>
<br class="Apple-interchange-newline">
</div>
<br class="Apple-interchange-newline">
<br class="Apple-interchange-newline">
</div>
<br class="">
</div>
</blockquote>
</body>
</html>