[Lustre-discuss] More: OSS crashes

Thu Jul 31 11:45:27 PDT 2008

Hi all,

I'm still successful in bringing my OSSs to a standstill if not crashing 
them.
Having reduced the number of stress jobs writing to Lustre (stress -d 2 
--hdd-noclean --hdd-bytes 5M) to four, and having reduced the number of 
OSS threads (options ost oss_num_threads=256 in /etc/modprobe.d/lustre), 
the OSS do not freeze entirely any more. Instead after ~ 15 hours,
- all stress jobs have terminated with Input/output error
- the MDT has marked the affected OSTs as Inactive
- the already open connections to the OSS remain active
- interactive collectl, "watch df", top sessions are still working
- the number of ll_ost threads is 256 ( number of ll_ost_io is 257 ?)
- log file writing has obviously stopped after only 10 hours
- already open shells  allow commands like "ps", I can kill some processes
- new ssh login doesn't work
- access to disk, as in "ls", brings the system to total freeze

The process table shows six ll_ost_io - threads, all using 38.9% cpu, 
all running for 419:21m. All the rest are sleeping.
The cause can't be system overloading or simple faulty hardware.  To 
give an impression of what is going on, I'm quoting the last collectl 
record:

##########################################################################################
### RECORD  139  (1217475195.342) (Thu Jul 31 05:33:15 2008) ###

# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER  NICE   SYS  WAIT   IRQ  SOFT STEAL  IDLE  INTR  CTXSW  PROC 
RUNQ   RUN   AVG1  AVG5 AVG15
         0         0        14      20        0         5         0 
     58      4255    53K         1         736         6        22.06 
31.28   31.13

# DISK SUMMARY (/sec)
#KBRead RMerged  Reads SizeKB  KBWrite WMerged Writes SizeKB
          0            0               0           0         83740 
   314          861     97

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost              KBRead   Reads    KBWrite  Writes
OST0004            0           0          40674      63
OST0005            0           0          40858      66
##########################################################################################

That's not too much for the machine, I'd reckon. And as mentioned in an 
earlier post, I have run the very same 'stress' test, also with CPU load 
or I/O load only, locally on machines that had crashed earlier. The test 
runs that wrote to disk finished only when the disks where 100% full 
(then formatted plain ext3), the tests with I/O load = 500 and CPU load 
= 1k are running for three days now.  Of course I don't know how 
reliable these test are.

Looks to me as if a few Lustre threads for some reason can't  process 
their I/O any more, kind of building up pressure and finally blocking 
all (disk) I/O.
Knowing this reason and how to avoid it would not only relieve these 
servers of some pressure... ;-)

Hm, hardware: the cluster is running Debian Etch, Kernel 2.6.22, Lustre 
1.6.5. The OSS are Supermicro X7DB8 fileservers, Xeon E5320, 8GB RAM, 
with 16 internal disks on two 3ware 9650 RAID controllers, forming two 
OSTs each.

Many thanks for any further hints,
Thomas