[Lustre-discuss] stuck OSS node

Fri Aug 5 02:01:47 PDT 2011

Hi Craig,

> Has anyone seen anything like this?

Yes: we had a similar problem a couple of times:

First, try to umount all OSTs on the affected OSS.

Some OSTs will (most likely) fail to umount. (umount gets stuck due to the ll_ost_io_?? thread).
Note the 'broken' OSTs and kill the OSS (echo b > /proc/sysrq-trigger) after the 'good' OSTs finished umounting.

Afterwards do a simple 'e2fsck -f -p' on the bad OSTs - it should complain about corrupted directories and other nice things. If it doesn't -> upgrade to the latest fsck from whamcloud.
(We had a corruption a few months ago that was unfixable/not detected with the 1.8.4-sun e2fsprogs)

> This is a recent phenomena - we are not 
> sure, but we think it may be related to a particular workload.  Our o2ib 
> clients don't seem to have any trouble.

I don't think that this issue is related to the network: It's probably just 'bad luck' that only the tcp clients hit the corrupted directories.

Regards,
 Adrian