[Lustre-discuss] Hung OSS nodes

Kit Westneat kwestneat at datadirectnet.com
Tue Apr 22 19:33:24 PDT 2008


Hello lustre-discuss,

Though I'm beginning to believe it's not particularly a Lustre problem, 
I thought I'd ask here anyways. Basically we have a set of 4 Dell 2950s 
OSSes, running Lustre 1.6.4.3, that seem to randomly hang. NMI watchdog 
panics during seemingly random processes, ranging from keyboard to 
Lustre to heartbeat processes. When I turn off nmi watchdog it just sits 
there, accepting pings, but nothing else. The clients are all running 
1.6.4.3, we have upgraded the firmware to the latest version, and Dell 
diagnostics returns clean.

It /tends/ to happen at the beginning and end of an IO run, but we let 
it just sit once with some client mounts and it still hung. Our plan of 
action right now is to run memtest86 and see if perhaps there is some 
flaky memory, but the fact that it is occurring across all the nodes 
leaves me pessimistic.

I was wondering if anyone else had run into a similar problem, or had 
any advice as to how to proceed.

Thanks,
Kit

-- 
---
Kit Westneat
kwestneat at datadirectnet.com
812-484-8485




More information about the lustre-discuss mailing list