[lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

Patrick Farrell paf at cray.com
Fri Oct 19 06:02:14 PDT 2018


Marion,

You note the deadlock reoccurs on server reboot, so you’re really stuck.  This is most likely due to recovery where operations from the clients are replayed.

If you’re fine with letting any pending I/O fail in order to get the system back up, I would suggest a client side action: unmount (-f, and be patient) and /or shut down all of your clients.  That will discard things the clients are trying to replay, (causing pending I/O to fail).  Then shut down your servers and start them up again.  With no clients, there’s (almost) nothing to replay, and you probably won’t hit the issue on startup.  (There’s also the abort_recovery option covered in the manual, but I personally think this is easier.)

There’s no guarantee this avoids your deadlock happening again, but it’s highly likely it’ll at least get you running.

If you need to save your pending I/O, you’ll have to install patched software with a fix for this (sounds like WC has identified the bug) and then reboot.

Good luck!
- Patrick
________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Marion Hakanson <hakansom at ohsu.edu>
Sent: Friday, October 19, 2018 1:32:10 AM
To: lustre-discuss at lists.lustre.org
Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

This issue is really kicking our behinds:
https://jira.whamcloud.com/browse/LU-11465

While we're waiting for the issue to get some attention from Lustre developers, are there suggestions on how we can recover our cluster from this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  Rebooting the storage servers does not clear the hang-up, as upon reboot the MDS quickly ends up with the same number of D-state threads (around the same number as we have clients).  It seems to me like there is some state stashed away in the filesystem which restores the deadlock as soon as the MDS comes up.

Thanks and regards,

Marion

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20181019/4f4a2cb4/attachment.html>


More information about the lustre-discuss mailing list