[lustre-discuss] proper procedure after MDT kernel panic

Thu Aug 11 02:42:56 PDT 2016

Hi all,
Our MDT suffered a kernel panic (which I will post separately), the OSSs
stayed alive but the MDT was out for some time while nodes still tried to
interact with lustre.

So I have several questions:
a. what happens to processes/reading writing during such an event (if they
already have handles on the OSS for instance that makes a difference)? I
noticed several of our compute-nodes ended up filling their swap/RAM so I
assume some level of caching is happening until the MDT returns....

b. what is the best/proper procedure now to ensure filesystem integrity?
Should I take the filesystem offline and run an lfsck first on the MDT then
on the OSS?

Most documents I can find with google on the subject are spread over the
various old wikis so it is not clear to me how relevant they are....
Thanks,
Eli

Specs:
Server OS: CentOS 6.4 + lustre 2.5.3 from RPMs (1 MGS/MDS + 3 OSS)
Clients: Debian testing/unstable, kernel 4.2.8 + lustre 2.8.0 built from
source.
Network: Infiniband FDR (o2ib)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160811/c20e73bd/attachment.htm>