[lustre-discuss] no more free slots in catalog

Tue Dec 11 01:28:09 PST 2018

Le 10/12/2018 13:33, quentin.bouget at cea.fr a écrit :
> Le 10/12/2018 à 12:00, Julien Rey a écrit :
>> Hello,
>>
>> We are running lustre 
>> 2.8.0-RC5--PRISTINE-2.6.32-573.12.1.el6_lustre.x86_64.
>>
>> Since thursday we are getting a "bad address" error when trying to 
>> write on the lustre volume.
>>
>> Looking at the logs on the MDS, we are getting this kind of messages :
>>
>> Dec 10 06:26:18 localhost kernel: Lustre: 
>> 9593:0:(llog_cat.c:93:llog_cat_new_log()) lustre-MDD0000: there are 
>> no more free slots in catalog
>> Dec 10 06:26:18 localhost kernel: Lustre: 
>> 9593:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 45157 previous 
>> similar messages
>> Dec 10 06:26:18 localhost kernel: LustreError: 
>> 9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) lustre-MDD0000: 
>> cannot store changelog record: type = 6, name = 
>> 'PEPFOLD-00016_bestene1-mc-SC-min-grompp.log', t = 
>> [0x20000a58f:0x858e:0x0], p = [0x20000a57d:0x17fd9:0x0]: rc = -28
>> Dec 10 06:26:18 localhost kernel: LustreError: 
>> 9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 45157 
>> previous similar messages
>>
>>
>> I saw here that this issue was supposed to be solved in 2.8.0:
>> https://jira.whamcloud.com/browse/LU-6556
>>
>> Could someone help us unlocking this situation ?
>>
>> Thanks.
>>
> Hello,
>
> The log messages don't point at a "bad address" issue but rather at a 
> "no space left on device" one ("rc = -28" --> -ENOSPC).
>
> You most likely have, at some point, registered a changelog user on 
> your mds and that user is not consuming changelogs.
>
> You can check this by running:
>
> [mds0]# lctl get_param mdd.*.changelog_users
> mdd.lustre-MDT0000.changelog_users=
> current index: 3
> ID    index
> cl1   0
>
> The most important thing to look for is the distance between "current 
> index" and the index for "cl1", "cl2", ...
> I expect for at least one changelog user, that distance is 2^32 (the 
> maximum number of changelog records).
> Note that changelog indexes wrap around (0, 1, 2, ..., 4294967295, 0, 
> 1, ...).
>
> If I am right, then you can either deregister the changelog user:
>
> [mds0]# lctl --device lustre-MDT0000 changelog_deregister cl1
>
> or acknowledge the records:
>
> [client]# lfs changelog_clear lustre-MDT0000 cl1 0
>
> (clearing with index 0 is a shortcut for "acknowledge every changelog 
> records")
>
> Both those options may take a while.
>
> There is a third one that might yield faster result, but it is also 
> much more dangerous to use (you might want to check with your support 
> first) :
>
> [mds0]# umount /dev/mdt0
> [mds0]# mount -t ldiskfs /dev/mdt0 /mnt/lustre-mdt0
> [mds0]# rm /mnt/lustre-mdt0/changelog_catalog
> [mds0]# rm /mnt/lustre-mdt0/changelog_users
> [mds0]# umount /dev/mdt0
> [mds0]# mount -t lustre /dev/mdt0 <...> # remount the mdt where it was
>
> *I cannot garantee this will not trash your filesystem. Use at your 
> own risk.
> *
>
> ---
>
> In recent versions (2.12, maybe even 2.10), lustre comes with a 
> builtin garbage collector for slow/inactive changelog users.
>
> Regards,
> Quentin Bouget
>

Hello Quentin,

Many thanks for your quick reply.

This is what I got when I issued the command you suggested:

[root at lustre-mds]# lctl get_param mdd.*.changelog_users

mdd.lustre-MDT0000.changelog_users=

current index: 4160462682

ID    index

cl1   21020582

I then issued the following command:

[root at lustre-mds]# lctl --device lustre-MDT0000 changelog_deregister cl1

It's been running for almost 20 hours now. Do you have an estimation of 
the time it could take ?

Best,

-- 
Julien REY

Plate-forme RPBS
Molécules Thérapeutiques In Silico (MTi)
Université Paris Diderot - Paris VII
tel : 01 57 27 83 95

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181211/72b03ed7/attachment.html>