[lustre-discuss] Corrupted? MDT not mounting

Tue May 10 15:44:05 PDT 2022

On Wed, 11 May 2022 at 04:37, Laura Hild <lsh at jlab.org> wrote:
> The non-dummy SRP module is in the kmod-srp package, which isn't included in the Lustre repository...

Thanks Laura,
Yeah, I realised that earlier in the week, and have rebuilt the srp
module from source via mlnxofedinstall, and sure enough installing
srp-4.9-OFED.4.9.4.1.6.1.kver.3.10.0_1160.49.1.el7_lustre.x86_64.x86_64.rpm
(gotta love those short names) gives me working srp again.

Hat tip to a DDN contact here (we owe him even more beers now) for
some extra tuning parameters:
    options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048
allow_ext_sg=1 ch_count=1 use_imm_data=0
but I'm pleased to say that it _seems_ to be working much better. I'd
done one half of the HA pairs earlier in the week, lfsck completed,
full robinhood scan done (dropped the DB and rescanned from fresh) and
I'm just bringing the other half of the pairs up to the same software
stack now.

Couple of pointers for anyone caught in the same boat that apparently
we did correctly:
* upgrade your f2fsprogs to the latest - if your fsck'ing disks make
sure you're not introducing more problems with a buggy old e2fsck
* tunefs.lustre --writeconf isn't too destructive (see the warnings,
you'll lose pool info but in our case that wasn't critical)
* monitoring is good but tbh the rate of change and that it happened
out of hours means we likely couldn't have intervened
* so quotas are better.

Thanks to those who replied on and off-list - I'm just grateful we
only had the pair of MDTs, not the 40 (!!!) that Origin's getting
(yeah, I was watching the LUG talk last night) - service isn't quite
back to users but we're getting there!

Andrew