[Lustre-discuss] More: OSS crashes
Andreas Dilger
adilger at sun.com
Thu Jul 31 15:32:29 PDT 2008
On Jul 31, 2008 20:45 +0200, Thomas Roth wrote:
> I'm still successful in bringing my OSSs to a standstill if not crashing
> them.
> Having reduced the number of stress jobs writing to Lustre (stress -d 2
> --hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of
> OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre),
> the OSS do not freeze entirely any more. Instead after ~ 15 hours,
> - all stress jobs have terminated with Input/output error
> - the MDT has marked the affected OSTs as Inactive
> - the already open connections to the OSS remain active
> - interactive collectl, "watch df", top sessions are still working
> - the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?)
> - log file writing has obviously stopped after only 10 hours
> - already open shells allow commands like "ps", I can kill some processes
> - new ssh login doesn't work
> - access to disk, as in "ls", brings the system to total freeze
>
> The process table shows six ll_ost_io - threads, all using 38.9% cpu,
> all running for 419:21m. All the rest are sleeping.
> The cause can't be system overloading or simple faulty hardware.
You need to look at the process table (sysrq-t) and get the stacks of
the running and blocked lustre processes. Also useful would be the
memory information (sysrq-m) to see if the node is out of free memory,
and if so where it is gone.
If you can still run some commands, then "cat /proc/slabinfo" may
also be useful.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list