[lustre-discuss] varying sequential read performance.

Thu Apr 5 09:19:05 PDT 2018

> On Apr 5, 2018, at 11:31 AM, John Bauer <bauerj at iodoctors.com> wrote:
> 
> I don't have access to the OSS so I cant report on the Lustre settings.  I think the client side max cached is 50% of memory.

Looking at your cache graph, that looks about right.

> After speaking with Doug Petesch of Cray,  I though I would look into numa effects on this job.  I now also monitor the contents of
> /sys/devices/system/node/node?/meminfo 
> and ran the job with numactl --cpunodebind=0
> Interestingly enough, I now sometimes get dd transfer rates of 2.2GiB/s.  Plotting the .../node?/meminfo[FilePages] value versus time for the 2 cpunodes shows that the
> data is now mostly placed on node0.  Unfortunately, the variable rates still remain, as one would expect if it is an OSS caching issue, but the poor performance is also better.

Have you ever looked at the linux vm.zone_reclaim_mode parameter?  I have seen some slowdowns on Lustre servers before when this parameter was set to the default value (which I think is “1”), but those issues largely went away when I changed it to “0”.  Since this parameter causes the system to prefer allocating memory in the same zone as the process, I was seeing memory usage on one zone nearly full while there was still lots of free memory in other zones.  So processes would block waiting for the kernel to free up memory in their zone even though there was free memory elsewhere.  I wonder if that parameter could also have an effect on the client side.  Would the single threaded nature of “dd” cause all the cached data to land in one zone?  If so, is the IO blocking while the client tries to clear up space in that zone to cache new incoming data?

If you have the ability to modify the vm.zone_reclaim_mode parameter (or have an admin do it for you), then it might be worth looking at.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu