[lustre-discuss] [WARNING: ATTACHMENT UNSCANNED] MDS kernel panics - 2.5.3 - waking for gap in transno

Tue Sep 15 06:01:11 PDT 2015

Hi Patrick,

   First of all, thanks much for taking the time to look into
this. It is much appreciated!

On 9/14/15 3:10 PM, Patrick Farrell wrote:
> Hmm.  Lfsck is not a bad idea, but it's unlikely to fix this as it does not actually address quota issues.
>
> The VBR message is not entirely unexpected, I believe it's related to replay after a crash, though I don't know how serious it is.
>
> Two thoughts:
> one, did you run e2fsck more than once?  Sometimes one pass will fix things that will reveal further issues.  A good rule of thumb is to run repeatedly until you get a 'clean' run (or some sort of unfixable pathological case).

Yes, but I probably should have run it at least one more time.

>
> Two, since you're still seeing quota issues, your best bet may be to disable and re-enable quotas with tunefs.
> This will cause them to be regenerated from scratch, which can take a while but should get you back to a clean state.
> (Note that you should only need to do this on the MDT.)
>

Sounds like a good idea. At least this particular file system
is only about 300TB, so it shouldn't take too long to regenerate.

> Here's a guess at your sequence of events, which you probably already suspect:
> LBUG, essentially random bad luck (not related to quotas)

Yep. Only indicator here pointed roughly in the direction of
locking.

> Quota corruption because quota stuff was in flight when the system panicked

Looks likely. I haven't looked into Lustre quotas enough
to know how likely/possible it is for quotas to de-sync
over time. It seems like keeping track of that sort of
info in a distributed filesystem could be a little tricky. :-)

> E2fsck tried but was not able to fix everything

I think it took care of most of it, and at least there were no
indicators of more serious corruption. Pretty impressive since
it had to deal with a couple of panics in various states of
recovery.

The one out of the ordinary thing I noticed was that the
recovery took quite a lot longer than "normal." Happily, we
have not had too many occasions to see this, but a "normal"
recovery for us takes between 5-10 minutes. Usually closer
to 5 minutes.

In this case, it took just about 25 minutes. During that
window, the VBR messages were extensive. Shortly after that
recovery completed, and the second panic happened. The recovery
from that panic looked normal and things have been fine
since.

That leads me to speculate that the locking state was not
entirely good and it took some time to sort that out. Once
it was sorted, normal function could resume. Not sure that makes
any sense, or is helpful in any way, but I figure I can pass
it along in case it's useful to the folks who have more of
a clue than I do.

Thanks again!

Best,

---Steve

>
> - Patrick
> ________________________________________
> From: lustre-discuss [lustre-discuss-bounces at lists.lustre.org] on behalf of Steve Barnet [barnet at icecube.wisc.edu]
> Sent: Monday, September 14, 2015 2:06 PM
> To: lustre-discuss at lists.lustre.org
> Subject: [WARNING: ATTACHMENT UNSCANNED][lustre-discuss] MDS kernel panics - 2.5.3 - waking for gap in  transno
>
> Hi all,
>
>     Earlier today, one of our MDSes kernel panicked. I was not able
> to grab the stack trace, but it appeared to be in the locking code
> path.
>
> The last entries prior to the crash on our remote syslog server
> looked like this:
>
> Sep 14 10:17:19 lfs-us-mds kernel: LustreError:
> 6864:0:(mdt_handler.c:1444:mdt_getattr_name_lock()) ASSERTION( lock !=
> NULL ) failed: Invalid lock handle 0x55755bac33ca291d
> Sep 14 10:17:19 lfs-us-mds kernel: LustreError:
> 6864:0:(mdt_handler.c:1444:mdt_getattr_name_lock()) LBUG
>
>
> After a reboot and e2fsck of the MDT, we saw many errors that look
> like this:
>
>
> Sep 14 12:40:35 lfs-us-mds kernel: LustreError:
> 3307:0:(ldlm_lib.c:1751:check_for_next_transno()) lfs3-MDT0000: waking
> for gap in transno, VBR is OFF (skip: 66055727404, ql: 15, comp: 172,
> conn: 187, next: 66055727409, last_committed: 66055720625)
>
>
> The output from the e2fsck is attached, but the quick summary is
> that it was entirely QUOTA WARNINGs. Shortly after recovery
> completed, the machine panicked again (trace attached).
>
> After a reboot, it appears that the system has stabilized, but
> as you might expect, it does not leave me with a warm, fuzzy
> feeling.
>
> Some of my searching indicates that an lfsck may be needed, but
> before I start too far down that path, I'd like to have some idea
> that this is indeed reasonable.
>
> We are using 2.5.3 on all servers and clients:
>
> lfs-us-mds ~ # cat /proc/fs/lustre/version
> lustre: 2.5.3
> kernel: patchless_client
> build:  2.5.3-RC1--PRISTINE-2.6.32-431.23.3.el6_lustre.x86_64
>
>
> OS is Scientific Linux 6.4:
>
> Linux lfs-us-mds 2.6.32-431.23.3.el6_lustre.x86_64 #1 SMP Thu Aug 28
> 20:20:13 PDT 2014 x86_64 x86_64 x86_64 GNU/Linux
>
>
> Any advice?
>
> Thanks much, in advance!
>
> Best,
>
> ---Steve
>
> --
> Steve Barnet
> UW-Madison - IceCube
> barnet at icecube.wisc.edu
>
>
>