[lustre-discuss] no more free slots in catalog

Mon Dec 10 04:33:44 PST 2018

Le 10/12/2018 à 12:00, Julien Rey a écrit :
> Hello,
>
> We are running lustre 
> 2.8.0-RC5--PRISTINE-2.6.32-573.12.1.el6_lustre.x86_64.
>
> Since thursday we are getting a "bad address" error when trying to 
> write on the lustre volume.
>
> Looking at the logs on the MDS, we are getting this kind of messages :
>
> Dec 10 06:26:18 localhost kernel: Lustre: 
> 9593:0:(llog_cat.c:93:llog_cat_new_log()) lustre-MDD0000: there are no 
> more free slots in catalog
> Dec 10 06:26:18 localhost kernel: Lustre: 
> 9593:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 45157 previous 
> similar messages
> Dec 10 06:26:18 localhost kernel: LustreError: 
> 9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) lustre-MDD0000: cannot 
> store changelog record: type = 6, name = 
> 'PEPFOLD-00016_bestene1-mc-SC-min-grompp.log', t = 
> [0x20000a58f:0x858e:0x0], p = [0x20000a57d:0x17fd9:0x0]: rc = -28
> Dec 10 06:26:18 localhost kernel: LustreError: 
> 9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 45157 previous 
> similar messages
>
>
> I saw here that this issue was supposed to be solved in 2.8.0:
> https://jira.whamcloud.com/browse/LU-6556
>
> Could someone help us unlocking this situation ?
>
> Thanks.
>
Hello,

The log messages don't point at a "bad address" issue but rather at a 
"no space left on device" one ("rc = -28" --> -ENOSPC).

You most likely have, at some point, registered a changelog user on your 
mds and that user is not consuming changelogs.

You can check this by running:

[mds0]# lctl get_param mdd.*.changelog_users
mdd.lustre-MDT0000.changelog_users=
current index: 3
ID    index
cl1   0

The most important thing to look for is the distance between "current 
index" and the index for "cl1", "cl2", ...
I expect for at least one changelog user, that distance is 2^32 (the 
maximum number of changelog records).
Note that changelog indexes wrap around (0, 1, 2, ..., 4294967295, 0, 1, 
...).

If I am right, then you can either deregister the changelog user:

[mds0]# lctl --device lustre-MDT0000 changelog_deregister cl1

or acknowledge the records:

[client]# lfs changelog_clear lustre-MDT0000 cl1 0

(clearing with index 0 is a shortcut for "acknowledge every changelog 
records")

Both those options may take a while.

There is a third one that might yield faster result, but it is also much 
more dangerous to use (you might want to check with your support first) :

[mds0]# umount /dev/mdt0
[mds0]# mount -t ldiskfs /dev/mdt0 /mnt/lustre-mdt0
[mds0]# rm /mnt/lustre-mdt0/changelog_catalog
[mds0]# rm /mnt/lustre-mdt0/changelog_users
[mds0]# umount /dev/mdt0
[mds0]# mount -t lustre /dev/mdt0 <...> # remount the mdt where it was

*I cannot garantee this will not trash your filesystem. Use at your own 
risk.
*

---

In recent versions (2.12, maybe even 2.10), lustre comes with a builtin 
garbage collector for slow/inactive changelog users.

Regards,
Quentin Bouget

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181210/5f23c852/attachment.html>