[lustre-discuss] changelog catalog

Dilger, Andreas andreas.dilger at intel.com
Tue May 2 08:25:20 PDT 2017


On May 2, 2017, at 06:49, H.J. Zilverberg <h.j.zilverberg at rug.nl> wrote:
> 
> Hello all,
> 
> We are experiencing some problems with the changelog catalog.
> We had this enabled for robin hood but due to circumstances we stopped
> robin hood and forgot to disable the changelog.
> At first this didn't cause problems, but after a month or two users were
> unable to write/delete files. In the logs we got:
> 
> kernel: LustreError: 27410:0:(llog_cat.c:82:llog_cat_new_log()) no free
> catalog slots for log...
> 
> Investigating this issue showed us that there were quite a few records
> in the changelog.
> 
> [root at pg-mds02 log]# lctl get_param mdd.pghome01-MDT0000.changelog_users
> mdd.pghome01-MDT0000.changelog_users=current index: 4758154916
> ID    index
> cl1   609095732
> 
> Which looks like a 32bit number issue.

It is entirely possible that you have done 4.1B filesystem operations in
a few months, and all of the ChangeLog IDs are 64-bit values...

> De-registering the user didn't help, the process was hogging one cpu and
> after it ran for 2 days the filesystem was still acting strange.
> When creating a new file you would get a bad address error back, but the
> file was created. Editing the file after that did work.
> 
> So we decided to kill it, reboot the servers, fsck the file systems and
> mount it all again. This worked without a problem.
> To test if the changelog catalog was cleared, we decided to register a
> changelog catalog user again and this time the current index matched the
> user, which is what we expected. Unfortunately when we deregistered the
> user again, the process went back to hogging one cpu and managed to
> crash the server after a day.
> 
> In short we now have a working file system but are a little concerned
> about the leftovers from the changelog catalog.
> We think that there are still loads of uncleared records that don't
> really affect the system now, but could become an issue when we want the
> use the changelog catalog again.
> Is there anyway to find out how many records are left?
> Is it possible to remove these records manually?
> We are running Lustre 2.5.3-RC1

There have been quite a few fixes for ChangeLogs since 2.5.  I'd suggest to
upgrade to a more recent release.

That said, if the ChangeLogs are disabled but not cleaned up completely,
then at worst they are consuming space on the MDT and a few thousand inodes.
You can check the free space on the MDT with "lfs df", and in most cases the
MDT has enough free blocks to handle this, so it probably isn't an urgent issue
to upgrade and fix this.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list