[lustre-discuss] Huge amounts of reads caused by shared library access

Laifer, Roland (SCC) roland.laifer at kit.edu
Thu Sep 12 03:41:45 PDT 2024


Dear Lustre admins,

I wanted the share an issue which we see since about two years. Maybe 
the issue also exists at your site or you can provide hints how the 
issue can be alleviated.

The issue is that we have huge amounts of read operations on servers 
which seem to be caused by shared libraries stored on Lustre. Apparently 
the Lustre client cache does not work here as expected for many 
different applications. Note that we have installed most software 
packages on Lustre and if you don't do that you might not be affected.

Of course we have reported the issue to DDN support a long time ago. 
They found an issue which might be causing it, see 
https://jira.whamcloud.com/browse/LU-17463. But the patch is under 
development since many months and I'm not sure if it will really fix it.

Some more details:

The affected system has nearly 1000 nodes, is used by more than 1000 
active users and there are many small jobs which share the same node. 
The Lustre version on clients and servers is 2.12.9 with patches from 
DDN. The issue is currently causing multiple GB/s throughout and more 
than 100 K IOPS on the affected file system.

With Lustre jobstats we saw that some jobs were creating hundreds of 
millions read opertions. Other similar jobs did not have the issue, i.e. 
the problem is not easily reproducible. We have a complicated reproducer 
which works in most cases even on our test system.

Several users reported that they were only using software on the 
affected file system. The command "lctl get_param 
llite.<fs_name>*.stats" showed huge amounts of page_fault entries and 
there were indeed many page faults for shared libraries stored on the 
affected file system.

We also had discussions with another site where Lustre is provided from 
another vendor and they are seeing the same issue.

Regards,
   Roland
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5781 bytes
Desc: Kryptografische S/MIME-Signatur
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240912/be6bc8f8/attachment.bin>


More information about the lustre-discuss mailing list