[lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

Fri Oct 19 09:41:49 PDT 2018

> On Oct 19, 2018, at 10:42 AM, Marion Hakanson <hakansom at ohsu.edu> wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so far, that we had to unmount all the clients (which required rebooting most of them), then reboot all the storage servers, to get things unstuck until the problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the MDS, but that did not help.  Could well be I'm not using it right:

Aborting recovery is a server-side action, not something that is done on the client.  As mentioned by Peter, you can abort recovery on a single target after it is mounted by using “lctl —device <DEV> abort_recover”.  But you can also just skip over the recovery step when the target is mounted by adding the “-o abort_recov” option to the mount command.  For example, 

mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0

And similarly for OSTs.  So you should be able to just unmount your MDT/OST on the running file system and then remount with the abort_recov option.  >From a client perspective, the lustre client will get evicted but should automatically reconnect.   

Some applications can ride through a client eviction without causing issues, some cannot.  I think it depends largely on how the application does its IO and if there is any IO in flight when the eviction occurs.  I have had to do this a few times on a running cluster, and in my experience, we have had good luck with most of the applications continuing without issues.  Sometimes there are a few jobs that abort, but overall this is better than having to stop all jobs and remount lustre on all the compute nodes.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu