[Lustre-discuss] stuck OSS node
Adrian Ulrich
adrian at blinkenlights.ch
Fri Aug 5 02:01:47 PDT 2011
Hi Craig,
> Has anyone seen anything like this?
Yes: we had a similar problem a couple of times:
First, try to umount all OSTs on the affected OSS.
Some OSTs will (most likely) fail to umount. (umount gets stuck due to the ll_ost_io_?? thread).
Note the 'broken' OSTs and kill the OSS (echo b > /proc/sysrq-trigger) after the 'good' OSTs finished umounting.
Afterwards do a simple 'e2fsck -f -p' on the bad OSTs - it should complain about corrupted directories and other nice things. If it doesn't -> upgrade to the latest fsck from whamcloud.
(We had a corruption a few months ago that was unfixable/not detected with the 1.8.4-sun e2fsprogs)
> This is a recent phenomena - we are not
> sure, but we think it may be related to a particular workload. Our o2ib
> clients don't seem to have any trouble.
I don't think that this issue is related to the network: It's probably just 'bad luck' that only the tcp clients hit the corrupted directories.
Regards,
Adrian
More information about the lustre-discuss
mailing list