[Lustre-discuss] Hung OSS nodes
Kit Westneat
kwestneat at datadirectnet.com
Tue Apr 22 19:33:24 PDT 2008
Hello lustre-discuss,
Though I'm beginning to believe it's not particularly a Lustre problem,
I thought I'd ask here anyways. Basically we have a set of 4 Dell 2950s
OSSes, running Lustre 1.6.4.3, that seem to randomly hang. NMI watchdog
panics during seemingly random processes, ranging from keyboard to
Lustre to heartbeat processes. When I turn off nmi watchdog it just sits
there, accepting pings, but nothing else. The clients are all running
1.6.4.3, we have upgraded the firmware to the latest version, and Dell
diagnostics returns clean.
It /tends/ to happen at the beginning and end of an IO run, but we let
it just sit once with some client mounts and it still hung. Our plan of
action right now is to run memtest86 and see if perhaps there is some
flaky memory, but the fact that it is occurring across all the nodes
leaves me pessimistic.
I was wondering if anyone else had run into a similar problem, or had
any advice as to how to proceed.
Thanks,
Kit
--
---
Kit Westneat
kwestneat at datadirectnet.com
812-484-8485
More information about the lustre-discuss
mailing list