<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Peter,</p>

    <p>I've been reading the Baeldung pages, among others, to gain some

      insight on Linux buffer cache behavior. <br>

    </p>

    <pre class="moz-quote-pre" wrap=""><a class="moz-txt-link-freetext" href="https://www.baeldung.com/linux/file-system-caching">https://www.baeldung.com/linux/file-system-caching</a> and <a class="moz-txt-link-freetext" href="https://docs.kernel.org/admin-guide/sysctl/vm.html">https://docs.kernel.org/admin-guide/sysctl/vm.html</a>

<font size="4">As can been seen in the first image below, Lustre is having no trouble keeping up with the dirty pages.  Dirty pages are never more than 400MB on a 64GB system, well under 1%.  This dirty page data is drawn from /proc/meminfo while dd is running. Here are some of the vm dirty settings.

</font></pre>

    <p><font face="Courier New, Courier, monospace">vm.dirty_background_bytes

        = 0<br>

        vm.dirty_background_ratio = 10<br>

        vm.dirty_bytes = 0<br>

        vm.dirty_expire_centisecs = 3000<br>

        vm.dirty_ratio = 40<br>

        vm.dirty_writeback_centisecs = 500<br>

        vm.dirtytime_expire_seconds = 43200</font></p>

    <p>I am not sure what to make of your following comment.  I should

      have stated that the dd command used for this was <b>dd

        if=/dev/zero of=pfl.dat bs=1M</b><b> count=8000</b> .  I will

      also point out that I came across this behavior while debugging

      another problem and I was simply using dd to create a pfl striped

      file so I could check how the file was laid out on the OSTs.  Over

      the course of many runs I kept noticing the pauses in the writes

      and it strikes me that the behavior is odd in that there is

      typically a significant amount of inactive file pages and free

      memory ( second image below ).  I don't understand why those

      inactive file pages are not evicted, or free memory used, before

      evicting the pfl.dat pages which were just written.  What is

      driving the LRU eviction here? Also should point out that the

      cached memory is always always well under the 50% limit that is

      configured as Lustre's max.<br>

    </p>

    <pre class="moz-quote-pre" wrap="">>> Also while you surely know better I usually try to avoid

>> buffering large amounts of to-be-written data in RAM (whether on

>> the OSC or the OSS), and to my taste 8GiB "in-flight" is large.</pre>

    <p><a class="moz-txt-link-freetext" href="https://www.dropbox.com/scl/fi/5seamxgscdrat1eu2t5zn/dd_swapped.png?rlkey=oyicyq2a8eeqlgohndgalisy0&dl=0">https://www.dropbox.com/scl/fi/5seamxgscdrat1eu2t5zn/dd_swapped.png?rlkey=oyicyq2a8eeqlgohndgalisy0&dl=0</a></p>

    <p><img moz-do-not-send="false"

        src="cid:part1.q044zp1f.FNC9Jylf@iodoctors.com"

        title="dd_swapped_v_rtc" alt="" width="716" height="477"></p>

    <p><br>

    </p>

    <img moz-do-not-send="false"

      src="cid:part2.dRZziu7s.d8YNfSqr@iodoctors.com"

title="https://www.dropbox.com/scl/fi/djpdmxxo7o3iz13ia62gs/dd_inactive.png?rlkey=16j7ik7hb2ohnpyn9ztyt1b23&dl=0"

      alt="" width="716" height="477"><br>

    <div class="moz-cite-prefix">On 12/6/23 14:24,

      <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-request@lists.lustre.org">lustre-discuss-request@lists.lustre.org</a> wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:mailman.29114.1701894246.137413.lustre-discuss-lustre.org@lists.lustre.org">

      <pre class="moz-quote-pre" wrap="">Send lustre-discuss mailing list submissions to

        <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>

To subscribe or unsubscribe via the World Wide Web, visit

        <a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>

or, via email, send a message with subject or body 'help' to

        <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-request@lists.lustre.org">lustre-discuss-request@lists.lustre.org</a>

You can reach the person managing the list at

        <a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss-owner@lists.lustre.org">lustre-discuss-owner@lists.lustre.org</a>

When replying, please edit your Subject line so it is more specific

than "Re: Contents of lustre-discuss digest..."

Today's Topics:

   1. Coordinating cluster start and shutdown? (Jan Andersen)

   2. Re: Lustre caching and NUMA nodes (Peter Grandi)

   3. Re: Coordinating cluster start and shutdown?

      (Bertschinger, Thomas Andrew Hjorth)

   4. Lustre server still try to recover the lnet reply to the

      depreciated clients (Huang, Qiulan)

----------------------------------------------------------------------

Message: 1

Date: Wed, 6 Dec 2023 10:27:11 +0000

From: Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>

To: lustre <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>

Subject: [lustre-discuss] Coordinating cluster start and shutdown?

Message-ID: <a class="moz-txt-link-rfc2396E" href="mailto:696fac02-df18-4fe1-967c-02c3bca425d3@comind.io"><696fac02-df18-4fe1-967c-02c3bca425d3@comind.io></a>

Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre filesystem, so that the OSS systems don't attempt to mount disks before the MGT and MDT are online?

------------------------------

Message: 2

Date: Wed, 6 Dec 2023 12:40:54 +0000

From: <a class="moz-txt-link-abbreviated" href="mailto:pg@lustre.list.sabi.co.UK">pg@lustre.list.sabi.co.UK</a> (Peter Grandi)

To: list Lustre discussion <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.Lustre.org"><lustre-discuss@lists.Lustre.org></a>

Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes

Message-ID: <a class="moz-txt-link-rfc2396E" href="mailto:25968.27606.536270.208882@petal.ty.sabi.co.uk"><25968.27606.536270.208882@petal.ty.sabi.co.uk></a>

Content-Type: text/plain; charset=iso-8859-1

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">I have a an OSC caching question.? I am running a dd process

which writes an 8GB file.? The file is on lustre, striped

8x1M.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

How the Lustre instance servers store the data may not have a

huge influence on what happens in the client's system buffer

cache.

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">This is run on a system that has 2 NUMA nodes (? cpu sockets).

[...] Why does lustre go to the trouble of dumping node1 and

then not use node1's memory, when there was always plenty of

free memory on node0?

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as

'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described

in 'man numactl'?

Also while you surely know better I usually try to avoid

buffering large amounts of to-be-written data in RAM (whether on

the OSC or the OSS), and to my taste 8GiB "in-flight" is large.

------------------------------

Message: 3

Date: Wed, 6 Dec 2023 16:00:38 +0000

From: "Bertschinger, Thomas Andrew Hjorth" <a class="moz-txt-link-rfc2396E" href="mailto:bertschinger@lanl.gov"><bertschinger@lanl.gov></a>

To: Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>, lustre

        <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>

Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?

Message-ID:

        <a class="moz-txt-link-rfc2396E" href="mailto:PH8PR09MB103611A4B55E420410AE14ABAAB84A@PH8PR09MB10361.namprd09.prod.outlook.com"><PH8PR09MB103611A4B55E420410AE14ABAAB84A@PH8PR09MB10361.namprd09.prod.outlook.com></a>

Content-Type: text/plain; charset="iso-8859-1"

Hello Jan,

You can use the Pacemaker / Corosync high-availability software stack for this: specifically, ordering constraints [1] can be used.

Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its configuration is complex and difficult to get right, and it significantly complicates system administration. One downside of Pacemaker is that it is not easy to decouple the Pacemaker service from the Lustre services, meaning if you stop the Pacemaker service, it will try to stop all of the Lustre services. This might make it inappropriate for use cases that don't involve HA.

Given those downsides, if others in the community have suggestions on simpler means to accomplish this, I'd love to see other tools that can be used here (especially officially supported ones, if they exist).

[1] <a class="moz-txt-link-freetext" href="https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/constraints.html#specifying-the-order-in-which-resources-should-start-stop">https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/html/constraints.html#specifying-the-order-in-which-resources-should-start-stop</a>

- Thomas Bertschinger

________________________________________

From: lustre-discuss <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss-bounces@lists.lustre.org"><lustre-discuss-bounces@lists.lustre.org></a> on behalf of Jan Andersen <a class="moz-txt-link-rfc2396E" href="mailto:jan@comind.io"><jan@comind.io></a>

Sent: Wednesday, December 6, 2023 3:27 AM

To: lustre

Subject: [EXTERNAL] [lustre-discuss] Coordinating cluster start and shutdown?

Are there any tools for coordinating the start and shutdown of lustre filesystem, so that the OSS systems don't attempt to mount disks before the MGT and MDT are online?

_______________________________________________

------------------------------

Message: 4

Date: Wed, 6 Dec 2023 20:23:11 +0000

From: "Huang, Qiulan" <a class="moz-txt-link-rfc2396E" href="mailto:qhuang@bnl.gov"><qhuang@bnl.gov></a>

To: <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org">"lustre-discuss@lists.lustre.org"</a>

        <a class="moz-txt-link-rfc2396E" href="mailto:lustre-discuss@lists.lustre.org"><lustre-discuss@lists.lustre.org></a>

Cc: "Huang, Qiulan" <a class="moz-txt-link-rfc2396E" href="mailto:qhuang@bnl.gov"><qhuang@bnl.gov></a>

Subject: [lustre-discuss] Lustre server still try to recover the lnet

        reply to the depreciated clients

Message-ID:

        <a class="moz-txt-link-rfc2396E" href="mailto:BLAPR09MB685012E0F741E1B98F65F8C6CE84A@BLAPR09MB6850.namprd09.prod.outlook.com"><BLAPR09MB685012E0F741E1B98F65F8C6CE84A@BLAPR09MB6850.namprd09.prod.outlook.com></a>

Content-Type: text/plain; charset="iso-8859-1"

Hello all,

We removed some clients two weeks ago but we see the Lustre server is still trying to handle the lnet recovery reply to those clients (the error log is posted as below). And they are still listed in the exports dir.

I tried to run  to evict the clients but failed with  the error "no exports found"

lctl set_param mdt.*.evict_client=10.68.178.25@tcp

Do you know how to clean up the removed the depreciated clients? Any suggestions would be greatly appreciated.

For example:

[root@mds2 ~]# ll /proc/fs/lustre/mdt/data-MDT0000/exports/10.67.178.25@tcp/

total 0

-r--r--r-- 1 root root 0 Dec  5 15:41 export

-r--r--r-- 1 root root 0 Dec  5 15:41 fmd_count

-r--r--r-- 1 root root 0 Dec  5 15:41 hash

-rw-r--r-- 1 root root 0 Dec  5 15:41 ldlm_stats

-r--r--r-- 1 root root 0 Dec  5 15:41 nodemap

-r--r--r-- 1 root root 0 Dec  5 15:41 open_files

-r--r--r-- 1 root root 0 Dec  5 15:41 reply_data

-rw-r--r-- 1 root root 0 Aug 14 10:58 stats

-r--r--r-- 1 root root 0 Dec  5 15:41 uuid

/var/log/messages:Dec  6 12:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 13:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 13:20:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 13:35:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 13:50:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 14:05:17 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.178.25@tcp) recovery failed with -110

/var/log/messages:Dec  6 14:20:16 mds2 kernel: LNetError: 11579:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 1 previous similar message

/var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111

/var/log/messages:Dec  6 14:30:17 mds2 kernel: LNetError: 3806712:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 3 previous similar messages

/var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111

/var/log/messages:Dec  6 14:47:14 mds2 kernel: LNetError: 3812070:0:(lib-move.c:4005:lnet_handle_recovery_reply()) Skipped 8 previous similar messages

/var/log/messages:Dec  6 15:02:14 mds2 kernel: LNetError: 3817248:0:(lib-move.c:4005:lnet_handle_recovery_reply()) peer NI (10.67.176.25@tcp) recovery failed with -111

Regards,

Qiulan

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <a class="moz-txt-link-rfc2396E" href="http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm"><http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231206/89b7c124/attachment.htm></a>

------------------------------

Subject: Digest Footer

_______________________________________________

lustre-discuss mailing list

<a class="moz-txt-link-abbreviated" href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>

<a class="moz-txt-link-freetext" href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a>

------------------------------

End of lustre-discuss Digest, Vol 213, Issue 7

**********************************************

</pre>

    </blockquote>

  </body>

</html>