<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Le 10/12/2018 13:33,

      <a class="moz-txt-link-abbreviated" href="mailto:quentin.bouget@cea.fr">quentin.bouget@cea.fr</a> a écrit :<br>

    </div>

    <blockquote cite="mid:5f32d4fd-c1b3-b4ed-c9be-36b022ea304a@cea.fr"

      type="cite">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div class="moz-cite-prefix">Le 10/12/2018 à 12:00, Julien Rey a

        écrit :<br>

      </div>

      <blockquote type="cite"

        cite="mid:5C0E4760.4030606@univ-paris-diderot.fr">Hello, <br>

        <br>

        We are running lustre

        2.8.0-RC5--PRISTINE-2.6.32-573.12.1.el6_lustre.x86_64. <br>

        <br>

        Since thursday we are getting a "bad address" error when trying

        to write on the lustre volume. <br>

        <br>

        Looking at the logs on the MDS, we are getting this kind of

        messages : <br>

        <br>

        Dec 10 06:26:18 localhost kernel: Lustre:

        9593:0:(llog_cat.c:93:llog_cat_new_log()) lustre-MDD0000: there

        are no more free slots in catalog <br>

        Dec 10 06:26:18 localhost kernel: Lustre:

        9593:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 45157 previous

        similar messages <br>

        Dec 10 06:26:18 localhost kernel: LustreError:

        9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) lustre-MDD0000:

        cannot store changelog record: type = 6, name =

        'PEPFOLD-00016_bestene1-mc-SC-min-grompp.log', t =

        [0x20000a58f:0x858e:0x0], p = [0x20000a57d:0x17fd9:0x0]: rc =

        -28 <br>

        Dec 10 06:26:18 localhost kernel: LustreError:

        9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 45157

        previous similar messages <br>

        <br>

        <br>

        I saw here that this issue was supposed to be solved in 2.8.0: <br>

        <a moz-do-not-send="true" class="moz-txt-link-freetext"

          href="https://jira.whamcloud.com/browse/LU-6556">https://jira.whamcloud.com/browse/LU-6556</a>

        <br>

        <br>

        Could someone help us unlocking this situation ? <br>

        <br>

        Thanks. <br>

        <br>

      </blockquote>

      <p>Hello,</p>

      <p>The log messages don't point at a "bad address" issue but

        rather at a "no space left on device" one ("rc = -28" -->

        -ENOSPC).</p>

      <p>You most likely have, at some point, registered a changelog

        user on your mds and that user is not consuming changelogs.</p>

      <p>You can check this by running:</p>

      <pre>[mds0]# lctl get_param mdd.*.changelog_users

mdd.lustre-MDT0000.changelog_users=

current index: 3

ID    index

cl1   0

</pre>

      <p>The most important thing to look for is the distance between

        "current index" and the index for "cl1", "cl2", ...<br>

        I expect for at least one changelog user, that distance is 2^32

        (the maximum number of changelog records).<br>

        Note that changelog indexes wrap around (0, 1, 2, ...,

        4294967295, 0, 1, ...).</p>

      <p>If I am right, then you can either deregister the changelog

        user:</p>

      <pre>[mds0]# lctl --device lustre-MDT0000 changelog_deregister cl1

</pre>

      <p>or acknowledge the records:</p>

      <pre>[client]# lfs changelog_clear lustre-MDT0000 cl1 0

</pre>

      <p>(clearing with index 0 is a shortcut for "acknowledge every

        changelog records")</p>

      <p>Both those options may take a while.</p>

      <p>There is a third one that might yield faster result, but it is

        also much more dangerous to use (you might want to check with

        your support first) :<br>

      </p>

      <pre>[mds0]# umount /dev/mdt0

[mds0]# mount -t ldiskfs /dev/mdt0 /mnt/lustre-mdt0

[mds0]# rm /mnt/lustre-mdt0/changelog_catalog

[mds0]# rm /mnt/lustre-mdt0/changelog_users

[mds0]# umount /dev/mdt0

[mds0]# mount -t lustre /dev/mdt0 <...> # remount the mdt where it was

</pre>

      <p><b>I cannot garantee this will not trash your filesystem. Use

          at your own risk.<br>

        </b></p>

      <p>---</p>

      <p>In recent versions (2.12, maybe even 2.10), lustre comes with a

        builtin garbage collector for slow/inactive changelog users.</p>

      <p>Regards,<br>

        Quentin Bouget<br>

      </p>

    </blockquote>

    <br>

    Hello Quentin,<br>

    <br>

    Many thanks for your quick reply.<br>

    <br>

    This is what I got when I issued the command you suggested:<br>

    <br>

    <pre>[root@lustre-mds]# lctl get_param mdd.*.changelog_users</pre>

    <pre>mdd.lustre-MDT0000.changelog_users=</pre>

    <pre>current index: 4160462682</pre>

    <pre>ID    index</pre>

    <pre>cl1   21020582</pre>

    <br>

    I then issued the following command:<br>

    <pre>

[root@lustre-mds]# lctl --device lustre-MDT0000 changelog_deregister cl1</pre>

    <br>

    It's been running for almost 20 hours now. Do you have an estimation

    of the time it could take ?<br>

    <br>

    Best,<br>

    <pre class="moz-signature" cols="72">-- 

Julien REY


Plate-forme RPBS

Molécules Thérapeutiques In Silico (MTi)

Université Paris Diderot - Paris VII

tel : 01 57 27 83 95 </pre>

  </body>

</html>