[lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

Fri Oct 19 07:42:34 PDT 2018

Thanks for the feedback.  You're both confirming what we've learned so far, that we had to unmount all the clients (which required rebooting most of them), then reboot all the storage servers, to get things unstuck until the problem recurred.

I tried abort_recovery on the clients last night, before rebooting the MDS, but that did not help.  Could well be I'm not using it right:

- look up the MDT in "lctl dl" list.
- run "lctl abort_recovery $mdt" on all clients
- reboot the MDS.

The MDS still reported recovering all 259 clients at boot time.

BTW, we have a separate MGS from the MDS.  Could it be we need to reboot both MDS & MGS to clear things?

Thanks and regards,

Marion

> On Oct 19, 2018, at 07:28, Peter Bortas <bortas at gmail.com> wrote:
> 
> That should fix it, but I'd like to advocate for using abort_recovery.
> Compared to unmounting thousands of clients abort_recovery is a quick
> operation that takes a few minutes to do. Wouldn't say it gets used a
> lot, but I've done it on NSCs live environment six times since 2016,
> solving the deadlocks each time.
> 
> Regards,
> -- 
> Peter Bortas
> Swedish National Supercomputer Centre
> 
>> On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell <paf at cray.com> wrote:
>> 
>> 
>> Marion,
>> 
>> You note the deadlock reoccurs on server reboot, so you’re really stuck.  This is most likely due to recovery where operations from the clients are replayed.
>> 
>> If you’re fine with letting any pending I/O fail in order to get the system back up, I would suggest a client side action: unmount (-f, and be patient) and /or shut down all of your clients.  That will discard things the clients are trying to replay, (causing pending I/O to fail).  Then shut down your servers and start them up again.  With no clients, there’s (almost) nothing to replay, and you probably won’t hit the issue on startup.  (There’s also the abort_recovery option covered in the manual, but I personally think this is easier.)
>> 
>> There’s no guarantee this avoids your deadlock happening again, but it’s highly likely it’ll at least get you running.
>> 
>> If you need to save your pending I/O, you’ll have to install patched software with a fix for this (sounds like WC has identified the bug) and then reboot.
>> 
>> Good luck!
>> - Patrick
>> ________________________________
>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Marion Hakanson <hakansom at ohsu.edu>
>> Sent: Friday, October 19, 2018 1:32:10 AM
>> To: lustre-discuss at lists.lustre.org
>> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
>> 
>> This issue is really kicking our behinds:
>> https://jira.whamcloud.com/browse/LU-11465
>> 
>> While we're waiting for the issue to get some attention from Lustre developers, are there suggestions on how we can recover our cluster from this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  Rebooting the storage servers does not clear the hang-up, as upon reboot the MDS quickly ends up with the same number of D-state threads (around the same number as we have clients).  It seems to me like there is some state stashed away in the filesystem which restores the deadlock as soon as the MDS comes up.
>> 
>> Thanks and regards,
>> 
>> Marion
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org