Hi Alexander, 

Before I recieved this reply, I deregistered the cl1 user. It took a very long time, and I am not sure if it successfully finished or not since the server crashed once the next morning. 
Then, I  moved the old changelog_catalog file, and created  a zero changelog_user file instead. 
This is what I got from the old changelog_catalog file. 
# ls -l /tmp/changelog.dmp 
    -rw-r--r-- 1 root root 4153280 Dec  6 06:54 /tmp/changelog.dmp
    # llog_reader changelog.dmp |grep "type=1064553b" |wc -l 
This number is smaller than 64768, I am not sure if it is related to the unfinished deregisteration or not.  

The first record number is 1, the last record number of is 64767. I think there maybe some skipped record numbers: 
    # llog_reader changelog.dmp |grep "type=1064553b" |head -n 1 
    rec #1 type=1064553b len=64
    # llog_reader changelog.dmp |grep "type=1064553b" |tail -n 1 
    rec #64767 type=1064553b len=64
    # llog_reader changelog.dmp |grep "^rec" | grep -v "type=1064553b"  
return 0 lines. 

By the way, are the llog files you mentioned virtual or real? if they are real, where are they located? Need I clean them manually ?

Here are 4 questions which we cannot find answers in LU-1586:
1.       According to Andres?s reply, there should some unconsumed changelog files on our MDT, and these files have taken all the space (file quotas?) Lustre gives to changelog. With Lustre 2.1, these files are under OBJECTS directory and can be listed in ldiskfs mode. In our case, with Lustre 2.5.3, there is no OBJECTS directory can be found. In this case, how can we monitor the situation before the unconsumed changelogs takes up all the disk space?

The changelog base on one catalog file and a plain llog files. Catalog stores limited number of records about 64768. A catalog record size is 64 byte. Each record has information about plain llog file. A plain llog file stores records about IO operation. A number of records at the plain llog file is about 64768 with different record size. So changelog could store 64768^2 IO operations and it occupy filesystem space. The error "no free catalog slots" is happened when changelog catalog doesn`t have a slot to store a record about new plain lllog. All slots are filled or internal changelog markers became crazy and internal logic don`t work.

To be closer to the root cause, you need to dump a changelog catalog and check bitmap. Is there free slots? Something like

debugfs -R "dump changelog_catalog changelog_catalog.dmp" /dev/md55 &&
used=`llog_reader changelog_catalog.dmp | grep "type=1064553b" | wc -l` 
2.       Why there are so many unconsumed changelogs? Could it related to our frequent remount of MDT( abort_recovery mode )?

umount operation create half empty plain llog file. And changelog_clear can`t remove it, if all slots is freed. Only new mount can remove that file. It could be related or not.

3.   When we remount the MDT, robinhood is still running. Why robinhood can not consume those old changelogs after MDT service is recovered?
4.   Why there is a huge difference between current index(4199610352 ) and cl1(49035933) index?

Thank you for your time and help !



Alexander Boyko
