[lustre-discuss] Is there a way to have faster lustre file system checker (lfsck)?

Yong, Fan fan.yong at intel.com
Tue May 1 19:49:39 PDT 2018


Inline comments.

--
Cheers,
Nasf


> -----Original Message-----
> From: lustre-discuss [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf
> Of 代栋
> Sent: Wednesday, May 2, 2018 5:36 AM
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] Is there a way to have faster lustre file system checker
> (lfsck)?
> 
> Hi all,
> 
> I am still new to Lustre, so please let me know if I should send this message to
> devel-list.
> 
> This week, I tried to run LFSCK over a very small cluster configuration (1 mdt and
> 3 osts).  In this Lustre, I used about 300K inodes.  It took me about 80 mins
> to finish a LFSCK run.  And, more importantly, while I am running LFSCK, on
> both MDT and OSTS, the CPU utilization is 100%, taken by the lfsck thread.

Which version of Lustre and what is the LFSCK command line you used?


> I understand that lfsck is operating in an online mode, so it is slow.  But, I am
> wondering is there any way to accelerate this?  Especially if I am allowed to run
> it offline, for example, during weekly maintenance.

Your slow is abnormal, not related with online. The LFSCK can NOT be run under offline mode.


> 
> After checking the lfsck kernel logs, I noticed that in the phase2 scanning on
> OSTs, there is an 30 seconds interval between querying the MDTs.  I am
> wondering is there any reason to have this 30 seconds interval, and will lfsck on
> OSTs be faster if we remove such 30 seconds interval?

Normally, the master engine on the MDT will notify the LFSCK engine on the OST when the first phase done. But we can NOT guarantee that the LFSCK engine on the MDT always alive during the LFSCK (may because of some failure, or network trouble, or node crash, and so on), so in the 2nd phase scanning, if the LFSCK engine on the OST does not receive the notification from the MDT, it needs to query the LFSCK (on the MDT) status periodically. If the MDT finished the 1st phase scanning earlier than OST, then there will be no such query. Anyway, such query is NOT the reason of your slow LFSCK.


> 
> Thanks,
> - Dong
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list