[Lustre-discuss] More: OSS crashes

Thu Jul 31 15:32:29 PDT 2008

On Jul 31, 2008  20:45 +0200, Thomas Roth wrote:
> I'm still successful in bringing my OSSs to a standstill if not crashing 
> them.
> Having reduced the number of stress jobs writing to Lustre (stress -d 2 
> --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of 
> OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre), 
> the OSS do not freeze entirely any more. Instead after ~ 15 hours,
> - all stress jobs have terminated with Input/output error
> - the MDT has marked the affected OSTs as Inactive
> - the already open connections to the OSS remain active
> - interactive collectl, "watch df", top sessions are still working
> - the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?)
> - log file writing has obviously stopped after only 10 hours
> - already open shells  allow commands like "ps", I can kill some processes
> - new ssh login doesn't work
> - access to disk, as in "ls", brings the system to total freeze
> 
> The process table shows six ll_ost_io - threads, all using 38.9% cpu, 
> all running for 419:21m. All the rest are sleeping.
> The cause can't be system overloading or simple faulty hardware.

You need to look at the process table (sysrq-t) and get the stacks of
the running and blocked lustre processes.  Also useful would be the
memory information (sysrq-m) to see if the node is out of free memory,
and if so where it is gone.

If you can still run some commands, then "cat /proc/slabinfo" may
also be useful.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.