[lustre-discuss] "ls" hangs for certain files AND lfsck_namespace gets stuck in scanning-phase1, same position

Sternberg, Michael G. sternberg at anl.gov
Mon Jun 17 14:48:50 PDT 2019


Nasf, all,

Upon post-recovery analysis, it appears that the "ls" hangs on my system were caused by inconsistencies with MDT striping (DNE) – See some details below.

I'm not sure if I should file this in Jira but wanted to document at least here.


- Nasf mentioned agent inodes for cross-MDTs objects. Could it be that cross-MDT objects being *missing* causes such hangs?

In my case, secondary trouble arose because LFSCK got stuck and could not resolve the issue.

As for how this might have happened on my system: I did have a couple of unexpected *HA double failures* for MDS nodes a few months ago. These failures may have interfered with Lustre recovery. The double failures were due to how NetworkManager in RHEL7 would tear down the entire network stack on a server when just *one* of its network interfaces goes offline, as it does for a host-to-host heartbeat link on the active HA node when the peer node reboots.  {Solution: (a) disable NetworkManager control for  heartbeat interfaces, and (b) use Corosync RRP, i.e., secondary heartbeat links.}


- To clarify, I did *not* do an MDT file-level-restore; I just mentioned that as an item from the operations manual, from which I gathered the kinds of Lustre-internal files that were expendable (like LFSCK and OIs), and extrapolated a bit from there. If useful, I could do additional analysis.


- Possibly related:  https://jira.whamcloud.com/browse/LU-11584 , which also mentioned "ls -l" hanging.


Nasf, thank you for clarifying agent inodes. While my "rm" of all of them from the ldiskfs may have removed a few more than strictly necessary, the fact that *all* of my MDT-striped dirs were implicated (see below), plus one outlier, made their removal a promising avenue of attack.



Best wishes,
Michael


---------------------------
From post-recovery analysis
---------------------------

Prior to my intervention at the MDT ldiskfs level (removal of all 0-permission files), I had determined on a client the dirs for which "ls -l" would hang by running a number of du(1) processes in parallel and seeing which ones got stuck:

	ls -d /home/[A-Z]*/* /home/[a-z]* | xargs -n1 -P12 du

(The glob is two-level to account voluminous project and software build directories separately from equally plentiful user homes.) After the last input was been taken up by a du(1) process and indeed finished (as monitored by ps(8) and/or lsof(8)), I found stuck du processes for:

	/home/SHARE/XXX123/XXX-env/
	/home/SHARE/g-cXXX/
	/home/SHARE/g-fXXX/
	/home/krXXX/
	/home/SOFT/spXXX/
	/home/mcXXX/

Now, I had used MDT striping only sparingly upon initially populating the file system. Strikingly, from my notes, the dirs that I had created with an MDT stripe count of 2 were a strict subset of the ones that failed, in fact the majority:

	lfs mkdir -i 1 -c 2 /home/mcXXX
	lfs mkdir -i 1 -c 2 /home/krXXX
	lfs mkdir -i 1 -c 2 /home/SOFT/spXXX
	lfs mkdir -i 1 -c 2 /home/SHARE/g-fXXX
	lfs mkdir -i 1 -c 2 /home/SHARE/g-cXXX



> On 2019-06-16, at 09:00 , Yong, Fan <fan.yong at intel.com> wrote:
> 
> Hi Michael,
> 
> The inode with zero permission and zero owner/group is NOT equal to corruption, instead, it is quite possible Lustre agent inode for cross-MDTs object. In your case, for the file ".bash_histroy", its name entry exists on the MDT0 (with a local agent inode), the object itself resides on the MDT1. The permission and owner/group information for agent inode are always zero. On the other hand, its time bits is valid. That also indicates a Lustre backend agent, not corrupted one.
> 
> Usually, if there is data corruption for some inode, then the output may be like:
> 
> ?????  1 ??? ???            xxx ??? .bash_history
> 
> You resolved the stuck issue by removing these 'trouble' agent inodes, but it may not the root reason (and may cause data lost), although I do not know what the root reason is. Anyway, if you have restored the system from file-level backup, then you may have lost the clues for the root reason.
> 


More information about the lustre-discuss mailing list