[lustre-devel] OSC cache page retention issue
John Bauer
bauerj at iodoctors.com
Fri Apr 26 08:26:12 PDT 2024
My first posting here. I hope it proves to be interesting.
I have an issue that I had posted to lustre-discuss some time back and
Andreas Dilger recommended I post to lustre-devel. I have a job that
does I/O to 3 files on a Lustre file system. The first file, SCRATCH,
that gets written and read in the first 50 seconds of the job. then
nothing until rtc=2950 (rtc is real time clock, in seconds). The second
file, SCR300, is written and read thoughout the job. I attach a couple
of images below to indicate the activity. In each image, the bottom
frame indicates how the file is being accessed, reads and writes, over
time. The Y axis indicates the part of the file being accessed and the
X indicates the time when the access occured. The top frame in each
image is the amount of page cache used by each OSC that the given file
is striped on ( 4x1M ), versus time. The third file is trivial ( 26MB
in size and 41MB transferred ) and will not be mentioned further.
What is of interest is that the 4 OSC's associated with SCRATCH do not
give up their cache page memory until rtc=2918, despite the file not
being accessed from rtc=50 until rtc=2950, especially in light of the
intense memory pressure created by the reading and writing of the SCR300
file.
Thinking this was virtual memory related, I decided to plot all of the
values in /proc/vmstat versus time while the job was running and see if
something of interest was occurring at rtc=2918. It turns out that the
only thing of interest is the value slabs_scanned is constant until
rtc=2918, then has a spike, as shown in the 3rd image below.
For completeness, I have also included a plot of several /proc/meminfo
values versus time ( the 4th image below ).
This was run on a dedicated 64GB node. It can be seen in the meminfo
plot that Lustre is using, at maximum, 50% of system memory. The 4
OSC's of the SCRATCH file are holding on to aggregate 22GB of memory of
32GB maximum used by Lustre. The SCR300 file is left with 10GB of
memory for repeated forward and backward reading of a 70GB region of the
104GB file. The SCR300 could really use the 22GB being held for SCRATCH.
Lustre version is 2.14.0_ddn98.
Single process, single host ( 2 NUMA nodes )
For completeness I have included a plot of memory info for each of the 2
NUMA nodes. 5th image below. It can clearly be seen that memory on
both numa nodes are active.
I should point out that this is not a new problem, nor unique to the
current system I am running it on. I had first observed it back in 2015
on an internal system at Cray Inc. A plot of the SCRATCH file position
activity is shown in the 6th image below. This was the actual
MSC.Nastran job running, and there was 13150 seconds of I/O wait out of
18213 second run. In this job, the 4 OSC's for the SCRATCH file never
released their memory, despite the file not being accessed for 18000
seconds ( 5 hours ).
Any light that could be cast upon this subject is welcomed.
John
The SCRATCH file activity and OSC memory usage
https://www.dropbox.com/scl/fi/mg55d6xv6hanmsebn4kvq/floorpan_SCRATCH.png?rlkey=rjkczx394n9mg54xtos87s0wd&st=zx56loc2&dl=0
https://www.dropbox.com/scl/fi/mg55d6xv6hanmsebn4kvq/floorpan_SCRATCH.png?rlkey=rjkczx394n9mg54xtos87s0wd&st=yauwg1uo&dl=0
The SCR300 file activity and OSC memory usage
https://www.dropbox.com/scl/fi/e8s4ku7zb2h6ndcwpsr3g/floorpan_SCR300.png?rlkey=elxv51uqjxv7pqwj6w3o4tmew&st=1yapxmen&dl=0
https://www.dropbox.com/scl/fi/e8s4ku7zb2h6ndcwpsr3g/floorpan_SCR300.png?rlkey=elxv51uqjxv7pqwj6w3o4tmew&st=cjwzzohl&dl=0
The slabs_scanned from /proc/vmstat versus time.
https://www.dropbox.com/scl/fi/qwg2uml7hpi204cazfynw/floorpan_slabs_scanned.png?rlkey=t4bhpobhwprhtlecbfsz1sy09&st=mjepc6uw&dl=0
https://www.dropbox.com/scl/fi/qwg2uml7hpi204cazfynw/floorpan_slabs_scanned.png?rlkey=t4bhpobhwprhtlecbfsz1sy09&st=74tmv31g&dl=0
Info from /proc/meminfo plotted versus time.
https://www.dropbox.com/scl/fi/ddd99mrt26hto4fp81ayn/floorpan_meminfo.png?rlkey=401jpennt2u3343yjp2llaiqj&st=09qaagpk&dl=0
https://www.dropbox.com/scl/fi/ddd99mrt26hto4fp81ayn/floorpan_meminfo.png?rlkey=401jpennt2u3343yjp2llaiqj&st=irguyp12&dl=0
Memory info for the 2 numa nodes versus time.
https://www.dropbox.com/scl/fi/g9auqkhzhhmp11m08hstl/floorpan_nodemem.png?rlkey=bzbmjkvb6b1lgrgj4h9xxvbny&st=ihhu4c94&dl=0
https://www.dropbox.com/scl/fi/g9auqkhzhhmp11m08hstl/floorpan_nodemem.png?rlkey=bzbmjkvb6b1lgrgj4h9xxvbny&st=ihhu4c94&dl=0
This is from the original job run back in 2015, internally at Cray Inc.
This was a 128GB system.
https://www.dropbox.com/scl/fi/5bwyqxjp8gi65c7qvnmcn/floorpan_orig.png?rlkey=j629zrdnuwm4lwiqv9wh6mxrt&st=4bo6c1jy&dl=0
https://www.dropbox.com/scl/fi/5bwyqxjp8gi65c7qvnmcn/floorpan_orig.png?rlkey=j629zrdnuwm4lwiqv9wh6mxrt&st=cb2sd1bc&dl=0
The SCR300 file for the original job back in 2015
https://www.dropbox.com/scl/fi/sr969igg0c1hvzn0kaq47/floorpan_orig_SCR300.png?rlkey=q4rg34s78ytxal4dx4k3b26rd&st=o2kzxa2z&dl=0
https://www.dropbox.com/scl/fi/sr969igg0c1hvzn0kaq47/floorpan_orig_SCR300.png?rlkey=q4rg34s78ytxal4dx4k3b26rd&st=o2kzxa2z&dl=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_SCRATCH.png
Type: image/png
Size: 10874 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_SCR300.png
Type: image/png
Size: 19578 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_slabs_scanned.png
Type: image/png
Size: 6968 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_meminfo.png
Type: image/png
Size: 12388 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0010.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_nodemem.png
Type: image/png
Size: 21109 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0011.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_orig.png
Type: image/png
Size: 10698 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0012.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_orig_SCR300.png
Type: image/png
Size: 18151 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240426/e1c85083/attachment-0013.png>
More information about the lustre-devel
mailing list