[lustre-discuss] changelog catalog

Wed May 10 00:41:23 PDT 2017

On 02-05-17 17:25, Dilger, Andreas wrote:
> On May 2, 2017, at 06:49, H.J. Zilverberg <h.j.zilverberg at rug.nl> wrote:
>> Hello all,
>>
>> We are experiencing some problems with the changelog catalog.
>> We had this enabled for robin hood but due to circumstances we stopped
>> robin hood and forgot to disable the changelog.
>> At first this didn't cause problems, but after a month or two users were
>> unable to write/delete files. In the logs we got:
>>
>> kernel: LustreError: 27410:0:(llog_cat.c:82:llog_cat_new_log()) no free
>> catalog slots for log...
>>
>> Investigating this issue showed us that there were quite a few records
>> in the changelog.
>>
>> [root at pg-mds02 log]# lctl get_param mdd.pghome01-MDT0000.changelog_users
>> mdd.pghome01-MDT0000.changelog_users=current index: 4758154916
>> ID    index
>> cl1   609095732
>>
>> Which looks like a 32bit number issue.
> It is entirely possible that you have done 4.1B filesystem operations in
> a few months, and all of the ChangeLog IDs are 64-bit values...
>
>> De-registering the user didn't help, the process was hogging one cpu and
>> after it ran for 2 days the filesystem was still acting strange.
>> When creating a new file you would get a bad address error back, but the
>> file was created. Editing the file after that did work.
>>
>> So we decided to kill it, reboot the servers, fsck the file systems and
>> mount it all again. This worked without a problem.
>> To test if the changelog catalog was cleared, we decided to register a
>> changelog catalog user again and this time the current index matched the
>> user, which is what we expected. Unfortunately when we deregistered the
>> user again, the process went back to hogging one cpu and managed to
>> crash the server after a day.
>>
>> In short we now have a working file system but are a little concerned
>> about the leftovers from the changelog catalog.
>> We think that there are still loads of uncleared records that don't
>> really affect the system now, but could become an issue when we want the
>> use the changelog catalog again.
>> Is there anyway to find out how many records are left?
>> Is it possible to remove these records manually?
>> We are running Lustre 2.5.3-RC1
> There have been quite a few fixes for ChangeLogs since 2.5.  I'd suggest to
> upgrade to a more recent release.

Upgrading is always an issue with a HPC cluster.

> That said, if the ChangeLogs are disabled but not cleaned up completely,
> then at worst they are consuming space on the MDT and a few thousand inodes.
> You can check the free space on the MDT with "lfs df", and in most cases the
> MDT has enough free blocks to handle this, so it probably isn't an urgent issue
> to upgrade and fix this.
That sounds hopefull. We will stretch it to the next upgrade of the cluster.

> Cheers, Andreas
> --

Thanks,
Henk-Jan

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3627 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170510/62c977d5/attachment.bin>