[lustre-discuss] Inaccessible directory

Wed Aug 31 09:54:16 PDT 2016

Hello,

We’ve run into a problem where an entire directory on our lustre file system has become inaccessible.  

# mount | grep lustre2
192.52.98.142 at tcp:/hpfs2eg3 on /lustre2 type lustre (rw,flock)
# ls -l /lustre2/mirrors/cpas
ls: cannot access /lustre2/mirrors/cpas: Stale file handle
# ls -l /lustre2/mirrors/
ls: cannot access /lustre2/mirrors/cpas: Stale file handle
4 drwxr-s--- 5 root     g27122      4096 Dec 23  2014 cev-repo/
? d????????? ? ?        ?              ?            ? cpas/
4 drwxrwxr-x 5 root     eg3         4096 Aug 21  2014 sls/

Fortunately, we have a backup of this directory from about a week ago.  However, I would like to figure out how this happened to prevent any further damage.  I’m not sure if we’re dealing with corruption in the LFS, damage to the underlying RAID or something else and I’d appreciate some help figuring this out.  Here’s some info on our lustre servers:

CentOS 6.4 2.6.32-358.23.2.el6_lustre.x86_64
Lustre 2.4.3 (I know - we need to upgrade...)
Hardware RAID 10 MDT (Adaptec 6805Q – SSD’s)
(19x) OSS’s - Hardware RAID 6 OST’s (Adaptec 6445)
1 27TB OST per OSS, ldiskfs
Dual homed via Ethernet and IB

Most Ethernet clients (~50 total) are CentOS 7 using lustre-client-2.8.0-3.10.0_327.13.1.el7.x86_64.x86_64.  Our compute nodes (~400 total) connect over IB and are still CentOS 6 using lustre-client-2.7.0-2.6.32_358.14.1.el6.x86_64.x86_64.

The lustre server hardware is about 4 years old now.  All RAID arrays are reported as healthy.  Searched JIRA and the mailing lists and couldn’t find anything related.  This sounded close at first:

https://jira.hpdd.intel.com/browse/LU-3550

But, as shown above, the issue shows up on a native lustre client, not an NFS export.  We are exporting this LFS via SMB but I don’t think that’s related.  

I think the next step is to run an e2fsck but we haven’t done that yet and would appreciate advise on stepping through this.  

Thanks,
Darby