[lustre-discuss] Lustre caching and NUMA nodes

Tue Dec 5 22:30:16 PST 2023

Andreas,

Thanks for the reply.

Client version is 2.14.0_ddn98. Here is a plot of the 
*write_RPCs_in_flight* plot.  Snapshot every 50ms.  The max for any of 
the samples for any of the OSCs was 1.  No RPCs in flight while the OSCs 
were dumping memory.  The number following the OSC name in the legends 
is the sum of the *write_RPCs_in flight* for all the intervals.  To be 
honest, I have never really looked at the RPCs in flight numbers.  I'm 
running as a lowly user, so I don't have access to any of the server 
data, so I have nothing on osd-ldiskfs.*.brw_stats.

I should also point out that the backing storage on the servers is SSD, 
so I would think the commiting to storage on the server side should be 
pretty quick.

I'm trying to get a handle on how Linux buffer cache works. Everything I 
find on the web is pretty old.  Here's one from 2012. 
https://lwn.net/Articles/495543/

Can someone point me to something more current, and perhaps Lustre related?

As for images, I think the list server strips the images.  In previous 
postings, when I would include images , what I got back when the list 
server broadcast it out had the iamges stripped. I'll include the images 
and also a link to the image on DropBox.

Thanks again,

John

https://www.dropbox.com/scl/fi/fgmz4wazr6it9q2aeo0mb/write_RPCs_in_flight.png?rlkey=d3ri2w2n7isggvn05se4j3a6b&dl=0

On 12/5/23 22:33, Andreas Dilger wrote:
>
> On Dec 4, 2023, at 15:06, John Bauer <bauerj at iodoctors.com> wrote:
>>
>> I have a an OSC caching question.  I am running a dd process which 
>> writes an 8GB file.  The file is on lustre, striped 8x1M. This is run 
>> on a system that has 2 NUMA nodes (cpu sockets). All the data is 
>> apparently stored on one NUMA node (node1 in the plot below) until 
>> node1 runs out of free memory.  Then it appears that dd comes to a 
>> stop (no more writes complete) until lustre dumps the data from the 
>> node1.  Then dd continues writing, but now the data is stored on the 
>> second NUMA node, node0.  Why does lustre go to the trouble of 
>> dumping node1 and then not use node1's memory, when there was always 
>> plenty of free memory on node0?
>>
>> I'll forego the explanation of the plot.  Hopefully it is clear 
>> enough.  If someone has questions about what the plot is depicting, 
>> please ask.
>>
>> https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0 
>> <https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0>
>
> Hi John,
> thanks for your detailed analysis.  It would be good to include the 
> client kernel and Lustre version in this case, as the page cache 
> behaviour can vary dramatically between different versions.
>
> The allocation of the page cache pages may actually be out of the 
> control of Lustre, since they are typically being allocated by the 
> kernel VM affine to the core where the process that is doing the IO is 
> running.  It may be that the "dd" is rescheduled to run on node0 
> during the IO, since the ptlrpcd threads will be busy processing all 
> of the RPCs during this time, and then dd will start allocating pages 
> from node0.
>
> That said, it isn't clear why the client doesn't start flushing the 
> dirty data from cache earlier?  Is it actually sending the data to the 
> OSTs, but then waiting for the OSTs to reply that the data has been 
> committed to the storage before dropping the cache?
>
> It would be interesting to plot the 
> osc.*.rpc_stats::write_rpcs_in_flight and ::pending_write_pages to see 
> if the data is already in flight.  The osd-ldiskfs.*.brw_stats on the 
> server would also useful to graph over the same period, if possible.
>
> It *does* look like the "node1 dirty" is kept at a low value for the 
> entire run, so it at least appears that RPCs are being sent, but there 
> is no page reclaim triggered until memory is getting low.  Doing page 
> reclaim is really the kernel's job, but it seems possible that the 
> Lustre client may not be suitably notifying the kernel about the dirty 
> pages and kicking it in the butt earlier to clean up the pages.
>
> PS: my preference would be to just attach the image to the email 
> instead of hosting it externally, since it is only 55 KB.  Is this 
> blocked by the list server?
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/c9d8e47a/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: FHzButkpRzFh7gBX.png
Type: image/png
Size: 11967 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/c9d8e47a/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mH0VmbLkCsaCb2QI.png
Type: image/png
Size: 54659 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/c9d8e47a/attachment-0003.png>