[Lustre-discuss] stuck OSS node

Thu Aug 4 12:16:44 PDT 2011

Hello,

We've got a problem here we hope someone can help us with.  We've have a 
few 1.8.5 OSS nodes which seems to get locked up Lustre-wise on our tcp 
clients from time to time.  This is a recent phenomena - we are not 
sure, but we think it may be related to a particular workload.  Our o2ib 
clients don't seem to have any trouble.

'lfs df' shows "Resource temporarily unavailable" for all OSTs on the 
affected OSS on all tcp clients when this happens.  When we look on the 
OSS itself we see secoknal_sd and ll_ost_io processes consuming cycles:

>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 10954 root      16   0     0    0    0 R 66.6  0.0 515:48.58 socknal_sd02
> 11370 root      16   0     0    0    0 R 64.9  0.0 241:21.83 ll_ost_io_91
> 10959 root      19   0     0    0    0 R 49.7  0.0 111:53.27 socknal_sd07

There are plenty of cycles free on each core of the OSS, though.  We do 
see that plenty of lustre logs were dumped, as well after service 
threads were inactive for 20 minutes.  I haven't been able to learn much 
from 'lctl debug_file' yet.

Further, we can see from 'netstat -t' that the Recv-Q count is 
increasing on the client connections - never decreasing.  Send-Q count 
is zero for all but two clients, where seem to be a constant non-zero 
value (few-several hundred K).

Anyway, it seems like the socknal and/or ll_ost_io_91 processes above 
are just stuck doing nothing productive.  Syslog messages aren't telling 
me why.  Has anyone seen anything like this?

We know that after rebooting the OSS our tcp clients will start working 
again.

Thanks,
Craig Prescott
UF HPC Center