[lustre-discuss] varying sequential read performance.

John Bent johnbent at gmail.com
Thu Apr 5 10:33:47 PDT 2018


"I suspect that this OSC is using an OSS that is under heavier load."

If you want to confirm this, it seems like you could create files with
striping parameters such that you have a single file on each OSS.  Well, I
know you can make stripe=1 so it's only on one OSS but can you
control/query on *which* OSS is the stripe?  Assuming you can, then you
just benchmark performance for each file (i.e. OSS) and you can discover
more explicitly whether you have a slow OSS.

On Thu, Apr 5, 2018 at 9:31 AM, John Bauer <bauerj at iodoctors.com> wrote:

> Rick,
>
> Thanks for reply.  Also thanks to Patrick Farrell for making me rethink
> this.
>
> I am coming to believe that it is an OSS issue.  Every time I run this
> job, the first pass of dd is slow, which I now attribute to all the OSSs
> needing to initially read the data in from disk to OSS cache.  If the
> subsequent passes of dd get back soon enough I then observe good
> performance.
> If not, performance goes back to initial rates.
>
> Inspecting the individual wait times for each of the dd reads for one of
> the poor performing dd passes,
> and correlating them to the OSC that fulfills each individual dd read, I
> see that 75% of the wait time is from the a single OSC.  I suspect that
> this OSC is
> using an OSS that is under heavier load.
>
> I don't have access to the OSS so I cant report on the Lustre settings.  I
> think the client side max cached is 50% of memory.
>
> After speaking with Doug Petesch of Cray,  I though I would look into numa
> effects on this job.  I now also monitor the contents of
> */sys/devices/system/node/node?/meminfo *
> and ran the job with *numactl --cpunodebind=0*
> Interestingly enough, I now sometimes get dd transfer rates of 2.2GiB/s.
> Plotting the .../node?/meminfo[FilePages] value versus time for the 2
> cpunodes shows that the
> data is now mostly placed on node0.  Unfortunately, the variable rates
> still remain, as one would expect if it is an OSS caching issue, but the
> poor performance is also better.
>
> Resulting plot with all *numactl --cpunodebind=0*
>
>
> Resulting plots with *numactl --cpunodebind=x* where x alternates between
> 0 and 1 for each subsequent dd pass.  And indeed, the file pages migrate
> between cpunodes.
>
>
>
> On 4/5/2018 9:22 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>
> John,
>
> I had a couple of thoughts (though not sure if they are directly relevant to your performance issue):
>
> 1) Do you know what caching settings are applied on the lustre servers?  This could have an impact on performance, especially if your tests are being run while others are doing IO on the system.
>
> 2) It looks like there is a parameter called llite.<fsname>.max_cached_mb that controls how much client side data is cached.  According to the manual, the default value is 3/4 of the host’s RAM (which would be 48GB in your case).  I don’t know why the cache seems to be used unevenly between your 4 OSTs, but it might explain why the cache for some OSTs decrease when others increase.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Scienceshttp://www.nics.tennessee.edu
>
>  On Apr 2, 2018, at 8:06 PM, John Bauer <bauerj at iodoctors.com> <bauerj at iodoctors.com> wrote:
>
> I am running dd 10 times consecutively to  read a 64GB file ( stripeCount=4 stripeSize=4M ) on a Lustre client(version 2.10.3) that has 64GB of memory.
> The client node was dedicated.
>
> for pass in 1 2 3 4 5 6 7 8 9 10
> do
>    of=/dev/null if=${file} count=128000 bs=512K
> done
>
> Instrumentation of the I/O from dd reveals varying performance.  In the plot below, the bottom frame has wall time
> on the X axis, and file position of the dd reads on the Y axis, with a dot plotted at the wall time and starting file position of every read.
> The slopes of the lines indicate the data transfer rate, which vary from 475MB/s to 1.5GB/s.  The last 2 passes have sharp breaks
> in the performance, one with increasing performance, and one with decreasing performance.
>
> The top frame indicates the amount of memory used by each of the file's 4 OSCs over the course of the 10 dd runs.  Nothing terribly odd here except that
> one of the OSC's eventually has its entire stripe ( 16GB ) cached and then never gives any up.
>
> I should mention that the file system has 320 OSTs.  I found LU-6370 which eventually started discussing LRU management issues on systems with high
> numbers of OST's leading to reduced RPC sizes.
>
> Any explanations for the varying performance?
> Thanks,
> John
>
> <johbmffmkkegkbkh.png>
> --
> I/O Doctors, LLC
> 507-766-0378
> bauerj at iodoctors.com
> _______________________________________________
> lustre-discuss mailing listlustre-discuss at lists.lustre.orghttp://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
> --
> I/O Doctors, LLC
> 507-766-0378bauerj at iodoctors.com
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180405/05cf96bc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bcmgmmocefcclfeb.png
Type: image/png
Size: 26211 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180405/05cf96bc/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dmnkhfdbfaoanfoi.png
Type: image/png
Size: 32829 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20180405/05cf96bc/attachment-0003.png>


More information about the lustre-discuss mailing list