<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Peter,</p>
<p>A delayed reply to one more of your questions, "What makes you
think "lustre" is doing that?" , as I had to make another run and
gather OSC stats on all the Lustre file systems mounted on the
host that I run dd on. <br>
</p>
<p>This host has 12 Lustre file systems, comprised of 507 OSTs.
While dd was running I instrumented the amount of cached data
associated with all 507 OSCs. That is reflected in the bottom
frame of the image below. Note that in the top frame there was
always about 5GB of free memory, and 50GB of cached data. I
believe it has to be a Lustre issue as the Linux buffer cache has
no knowledge that a page is a Lustre page. How is it that every
OSC, on all 12 file systems on the host, has its memory dropped to
0, yet all the other 50GB of cached data on the host remains?
It's as though dropcache is being run on only the lustre file
systems. My googling around finds no such feature in dropcache
that would allow file system specific dropping. Is there some
tuneable that gives Lustre pages higher potential for eviction
than other cached data?<br>
</p>
<p>Another subtle point of interest. Note that dd writing resumes,
as reflected in the growth of the cached data for its 8 OSTs,
before all the other OSCs have finished dumping. This is most
visible around 2.1 seconds into the run. Also different is that
this dumping phenomenon happened 3 times in the course of a 10
second run, instead of just 1 as in the previous run I was
referencing, costing this dd run 1.2 seconds.<br>
</p>
<p>John<br>
</p>
<img moz-do-not-send="false"
src="cid:part1.t8425yzB.JCdc4PN8@iodoctors.com"
title="https://www.dropbox.com/scl/fi/augs88r7lcdfd6wb7nwrf/pfe27_allOSC_cached.png?rlkey=ynaw60yknwmfjavfy5gxsuk76&dl=0"
alt="" width="746" height="536">
<p><br>
</p>
<div class="moz-cite-prefix">On 12/6/23 14:24,
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-request@lists.lustre.org">lustre-discuss-request@lists.lustre.org</a> wrote:<br>
</div>
<blockquote type="cite"
cite="mid:mailman.29114.1701894246.137413.lustre-discuss-lustre.org@lists.lustre.org">
<pre class="moz-quote-pre" wrap="">Send lustre-discuss mailing list submissions to
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
To subscribe or unsubscribe via the World Wide Web, visit
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
or, via email, send a message with subject or body 'help' to
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-request@lists.lustre.org">lustre-discuss-request@lists.lustre.org</a>
You can reach the person managing the list at
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-owner@lists.lustre.org">lustre-discuss-owner@lists.lustre.org</a>
When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."
Today's Topics:
1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
(Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
depreciated clients (Huang, Qiulan)
----------------------------------------------------------------------
Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +0000
From: Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>
To: lustre <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID: <a class="moz-txt-link-rfc2396E" href="mailto:696fac02-df18-4fe1-967c-02c3bca425d3@comind.io"><696fac02-df18-4fe1-967c-02c3bca425d3@comind.io></a>
Content-Type: text/plain; charset=UTF-8; format=flowed
Are there any tools for coordinating the start and shutdown of lustre filesystem, so that the OSS systems don't attempt to mount disks before the MGT and MDT are online?
------------------------------
Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +0000
From: <a class="moz-txt-link-abbreviated" href="mailto:pg@lustre.list.sabi.co.UK">pg@lustre.list.sabi.co.UK</a> (Peter Grandi)
To: list Lustre discussion <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.Lustre.org"><lustre-discuss@lists.Lustre.org></a>
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID: <a class="moz-txt-link-rfc2396E" href="mailto:25968.27606.536270.208882@petal.ty.sabi.co.uk"><25968.27606.536270.208882@petal.ty.sabi.co.uk></a>
Content-Type: text/plain; charset=iso-8859-1
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
What makes you think "lustre" is doing that?
Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?
Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?
Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.
------------------------------
Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +0000
From: "Bertschinger, Thomas Andrew Hjorth" <a class="moz-txt-link-rfc2396E" href="mailto:bertschinger@lanl.gov"><bertschinger@lanl.gov></a>
To: Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>, lustre
<a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>
Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:
<a class="moz-txt-link-rfc2396E" href="mailto:PH8PR09MB103611A4B55E420410AE14ABAAB84A@PH8PR09MB10361.namprd09.prod.outlook.com"><PH8PR09MB103611A4B55E420410AE14ABAAB84A@PH8PR09MB10361.namprd09.prod.outlook.com></a>
Content-Type: text/plain; charset="iso-8859-1"
Hello Jan,
You can use the Pacemaker / Corosync high-availability software stack for this: specifically, ordering constraints [1] can be used.
Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its configuration is complex and difficult to get right, and it significantly complicates system administration. One downside of Pacemaker is that it is not easy to decouple the Pacemaker service from the Lustre services, meaning if you stop the Pacemaker service, it will try to stop all of the Lustre services. This might make it inappropriate for use cases that don't involve HA.
Given those downsides, if others in the community have suggestions on simpler means to accomplish this, I'd love to see other tools that can be used here (especially officially supported ones, if they exist).
[1] <a class="moz-txt-link-freetext" href="https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/constraints.html#specifying-the-order-in-which-resources-should-start-stop">https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/constraints.html#specifying-the-order-in-which-resources-should-start-stop</a>
- Thomas Bertschinger
________________________________________
From: lustre-discuss <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss-bounces@lists.lustre.org"><lustre-discuss-bounces@lists.lustre.org></a> on behalf of Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>
Sent: Wednesday, December 6, 2023 3:27 AM
To: lustre
Subject: [EXTERNAL] [lustre-discuss] Coordinating cluster start and shutdown?
Are there any tools for coordinating the start and shutdown of lustre filesystem, so that the OSS systems don't attempt to mount disks before the MGT and MDT are online?
_______________________________________________
------------------------------
Message: 4
Date: Wed, 6 Dec 2023 20:23:11 +0000
From: "Huang, Qiulan" <a class="moz-txt-link-rfc2396E" href="mailto:qhuang@bnl.gov"><qhuang@bnl.gov></a>
To: <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org">"lustre-discuss@lists.lustre.org"</a>
<a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>
Cc: "Huang, Qiulan" <a class="moz-txt-link-rfc2396E" href="mailto:qhuang@bnl.gov"><qhuang@bnl.gov></a>
Subject: [lustre-discuss] Lustre server still try to recover the lnet
reply to the depreciated clients
Message-ID:
<a class="moz-txt-link-rfc2396E" href="mailto:BLAPR09MB685012E0F741E1B98F65F8C6CE84A@BLAPR09MB6850.namprd09.prod.outlook.com"><BLAPR09MB685012E0F741E1B98F65F8C6CE84A@BLAPR09MB6850.namprd09.prod.outlook.com></a>
Content-Type: text/plain; charset="iso-8859-1"
Hello all,
We removed some clients two weeks ago but we see the Lustre server is still trying to handle the lnet recovery reply to those clients (the error log is posted as below). And they are still listed in the exports dir.
I tried to run to evict the clients but failed with the error "no exports found"
lctl set_param mdt.*.evict_client=10.68.178.25@tcp
Do you know how to clean up the removed the depreciated clients? Any suggestions would be greatly appreciated.
For example:
[root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT0000/exports/10.67.178.25@tcp/
total 0
-r--r--r-- 1 root root 0 Dec 5 15:41 export
-r--r--r-- 1 root root 0 Dec 5 15:41 fmd_count
-r--r--r-- 1 root root 0 Dec 5 15:41 hash
-rw-r--r-- 1 root root 0 Dec 5 15:41 ldlm_stats
-r--r--r-- 1 root root 0 Dec 5 15:41 nodemap
-r--r--r-- 1 root root 0 Dec 5 15:41 open_files
-r--r--r-- 1 root root 0 Dec 5 15:41 reply_data
-rw-r--r-- 1 root root 0 Aug 14 10:58 stats
-r--r--r-- 1 root root 0 Dec 5 15:41 uuid
/var/log/messages:Dec 6 12:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110
/var/log/messages:Dec 6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message
/var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111
/var/log/messages:Dec 6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous similar messages
/var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111
/var/log/messages:Dec 6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous similar messages
/var/log/messages:Dec 6 15:02:14 mds2 kernel: LNetError: 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111
Regards,
Qiulan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <a class="moz-txt-link-rfc2396E" href="http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm"><http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm></a>
------------------------------
Subject: Digest Footer
_______________________________________________
lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>
------------------------------
End of lustre-discuss Digest, Vol 213, Issue 7
**********************************************
</pre>
</blockquote>
</body>
</html>