[lustre-discuss] No free catalog slots for log ( Lustre 2.5.3 & Robinhood 2.5.3 )

Tue Dec 1 23:11:55 PST 2015

Hi all, 

We meet a  “no free catalog slots for log” problem yesterday. Users got “Bad address” error when they are trying to delete or create a new file.

 Here are some console logs on MDS:
Dec  1 23:14:41  kernel: LustreError: 23658:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log...
Dec  1 23:14:42 kernel: LustreError: 23635:0:(llog_cat.c:82:llog_cat_new_log()) no free catalog slots for log...
Dec  1 23:14:42  kernel: LustreError: 23635:0:(llog_cat.c:82:llog_cat_new_log()) Skipped 3029 previous similar messages
Dec 1 23:14:42   kernel: LustreError: 23316:0:(mdd_dir.c:783:mdd_changelog_ns_store()) changelog failed: rc=-28, op6 jobOptions_sim_digam_10.txt.bosslog c[0x200010768:0x2118:0x0] p[0x200012a20:0x186c8:0x0]

We solved the problem by deregistering the cl1 user just as someone mentioned in this thread:
https://jira.hpdd.intel.com/browse/LU-1586
 # lctl --device besfs-MDT0000 changelog_deregister cl1 
The process has taken 230:41.21 minutes, and has not finished yet. Good news is that MDS service became normal just after we executed the command. To avoid the recurrence of this problem before we know why it happens, we unmasked all the changelog operations and stopped robinhood. 

We are running Lustre 2.5.3 and Robinhood 2.5.3. Currently, there are 80 million files. Usage of MDT is 65% capacity 19% inodes. The size of changelog_catlog is only 4M. 
-rw-r--r--  1 root root  4153280 Jul 21 15:18 changelog_catalog 
And the index of cl1 log is:
 lctl get_param mdd.besfs-MDT0000.changelog_users
mdd.besfs-MDT0000.changelog_users=current index: 4199610352
ID    index
cl1   49035933 

Here are 4 questions which we cannot find answers in LU-1586: 
1.       According to Andres’s reply, there should some unconsumed changelog files on our MDT, and these files have taken all the space (file quotas?) Lustre gives to changelog. With Lustre 2.1, these files are under OBJECTS directory and can be listed in ldiskfs mode. In our case, with Lustre 2.5.3, there is no OBJECTS directory can be found. In this case, how can we monitor the situation before the unconsumed changelogs takes up all the disk space? 
2.       Why there are so many unconsumed changelogs? Could it related to our frequent remount of MDT( abort_recovery mode )? 
3.   When we remount the MDT, robinhood is still running. Why robinhood can not consume those old changelogs after MDT service is recovered? 
4.   Why there is a huge difference between current index(4199610352 ) and cl1(49035933) index?  

Thank you for your time and help !

Wang,Lu

====================================================================
Computing center,the Institute of High Energy Physics, CAS, China
Wang, Lu ( 汪 璐 )                       Tel: (+86) 10 8823 6087
P.O. Box 918-7                           Fax: (+86) 10 8823 6839
Beijing 100049  P.R. China               Email: Lu.Wang at ihep.ac.cn
====================================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20151202/d1beccd4/attachment.htm>