[lustre-discuss] [WARNING: ATTACHMENT UNSCANNED] MDS kernel panics - 2.5.3 - waking for gap in transno

Mon Sep 14 13:10:18 PDT 2015

Hmm.  Lfsck is not a bad idea, but it's unlikely to fix this as it does not actually address quota issues.

The VBR message is not entirely unexpected, I believe it's related to replay after a crash, though I don't know how serious it is.

Two thoughts:
one, did you run e2fsck more than once?  Sometimes one pass will fix things that will reveal further issues.  A good rule of thumb is to run repeatedly until you get a 'clean' run (or some sort of unfixable pathological case).

Two, since you're still seeing quota issues, your best bet may be to disable and re-enable quotas with tunefs.  This will cause them to be regenerated from scratch, which can take a while but should get you back to a clean state.  (Note that you should only need to do this on the MDT.)

Here's a guess at your sequence of events, which you probably already suspect:
LBUG, essentially random bad luck (not related to quotas)
Quota corruption because quota stuff was in flight when the system panicked
E2fsck tried but was not able to fix everything

- Patrick
________________________________________
From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Steve Barnet [barnet at icecube.wisc.edu]
Sent: Monday, September 14, 2015 2:06 PM
To: lustre-discuss at lists.lustre.org
Subject: [WARNING: ATTACHMENT UNSCANNED][lustre-discuss] MDS kernel panics - 2.5.3 - waking for gap in  transno

Hi all,

   Earlier today, one of our MDSes kernel panicked. I was not able
to grab the stack trace, but it appeared to be in the locking code
path.

The last entries prior to the crash on our remote syslog server
looked like this:

Sep 14 10:17:19 lfs-us-mds kernel: LustreError:
6864:0:(mdt_handler.c:1444:mdt_getattr_name_lock()) ASSERTION( lock !=
NULL ) failed: Invalid lock handle 0x55755bac33ca291d
Sep 14 10:17:19 lfs-us-mds kernel: LustreError:
6864:0:(mdt_handler.c:1444:mdt_getattr_name_lock()) LBUG

After a reboot and e2fsck of the MDT, we saw many errors that look
like this:

Sep 14 12:40:35 lfs-us-mds kernel: LustreError:
3307:0:(ldlm_lib.c:1751:check_for_next_transno()) lfs3-MDT0000: waking
for gap in transno, VBR is OFF (skip: 66055727404, ql: 15, comp: 172,
conn: 187, next: 66055727409, last_committed: 66055720625)

The output from the e2fsck is attached, but the quick summary is
that it was entirely QUOTA WARNINGs. Shortly after recovery
completed, the machine panicked again (trace attached).

After a reboot, it appears that the system has stabilized, but
as you might expect, it does not leave me with a warm, fuzzy
feeling.

Some of my searching indicates that an lfsck may be needed, but
before I start too far down that path, I'd like to have some idea
that this is indeed reasonable.

We are using 2.5.3 on all servers and clients:

lfs-us-mds ~ # cat /proc/fs/lustre/version
lustre: 2.5.3
kernel: patchless_client
build:  2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64

OS is Scientific Linux 6.4:

Linux lfs-us-mds 2.6.32-431.23.3.el6_lustre.x86_64 #1 SMP Thu Aug 28
20:20:13 PDT 2014 x86_64 x86_64 x86_64 GNU/Linux

Any advice?

Thanks much, in advance!

Best,

---Steve

--
Steve Barnet
UW-Madison - IceCube
barnet at icecube.wisc.edu