[Lustre-discuss] OSS crashes

Thu Jul 24 09:24:11 PDT 2008

Well, guess what - I did that. These OSS are all running collectl 
already ;-)
At first I had in addition to the collectl daemon an xterm with 
"collectl -sL -od" which at least gave me the point in time when the 
machine stopped. Didn't make me any wiser. At least there was no Lustre 
write activity any more at the time of the crash.
Last try I added 'c' and 'n' and found that in the last minute the CPU 
load had risen to 78. No Lustre or network activity, though. That's a 
bit much but it would be a pity if sufficient to crash a server:

### RECORD 27098 >>> lxfs89 <<< (1216869685.007) (Thu Jul 24 05:21:25 
2008) ###

#                CPU SUMMARY (INTR, CTXSW & PROC /sec)
#                USER  NICE   SYS  WAIT   IRQ  SOFT STEAL  IDLE  INTR 
CTXSW  PROC  RUNQ   RUN   AVG1  AVG5 AVG15
07/24 05:21:25      0     0    37    10     0     0     0    51   462 
  281     0   538    16  79.11 78.64 76.93

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#              Ost                KBRead   Reads    KBWrite  Writes
07/24 05:21:25 OST0015        0       0          0       0
07/24 05:21:25 OST0016        0       0          0       0

# NETWORK STATISTICS (/sec)
#               Num    Name   KBIn  PktIn SizeIn  MultI   CmpI  ErrIn 
KBOut PktOut  SizeO   CmpO ErrOut
07/24 05:21:25    0     lo:      0      0      0      0      0      0 
    0      0      0      0      0
07/24 05:21:25    1   eth0:     28    385     76      0      0      0 
    6     29    230      0      0
07/24 05:21:25    2   eth1:      0      0      0      0      0      0 
    0      0      0      0      0
07/24 05:21:25    3   eth2:      0      0      0      0      0      0 
    0      0      0      0      0

Then I actually started reading man pages and was able to extract some 
info from the collectl - logfile. -scnL told me that writing to the disk 
(to the log) stopped 5 hours before the last activity I'd seen on the 
xterm. The CPU load was 37 at that moment, and there was still packages 
coming in and being written to the two OSTs:

### RECORD  126 >>> lxfs89 <<< (1216851732.479) (Thu Jul 24 00:22:12 
2008) ###

# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER  NICE   SYS  WAIT   IRQ  SOFT STEAL  IDLE  INTR  CTXSW  PROC 
RUNQ   RUN   AVG1  AVG5 AVG15
      0     0     4    19     0     0     0    75   729   3302     0 
471     9  36.87 35.96 34.36

# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost              KBRead   Reads    KBWrite  Writes
OST0015        0       0       2751       2
OST0016        0       0       2955       2

# NETWORK STATISTICS (/sec)
#Num    Name   KBIn  PktIn SizeIn  MultI   CmpI  ErrIn  KBOut PktOut 
SizeO   CmpO ErrOut
    0     lo:      0      0      0      0      0      0      0      0 
    0      0      0
    1   eth0:   5139   3559   1478      0      0      0    116   1475 
   81      0      0
    2   eth1:      0      0      0      0      0      0      0      0 
    0      0      0
    3   eth2:      0      0      0      0      0      0      0      0 
    0      0      0

However, I have not yet gotten further in learning the abilities of 
collectl or the interpretation of its info.

In another xterm window I had htop running, though. This stopped with 
three 100% processes on top, ll_ost_io_42, ll_ost_io_59, ll_ost_io_01, 
each of which had been running for 4h 58m. Fits the 5 hour gap mentioned 
earlier.
Still I don't have a clue as to what actually causes this behavior and 
how to avoid.

On the next crash I'll try to get a stack trace, and logging the console 
to more than the xterm buffer surely is something we ought to do as well.

Thanks for your advice,
Thomas

Mark Seger wrote:
>>> Where else could I look for overloaded hardware capacities?
>>>     
>> Not sure.  That's quite hardware specific.
>>   
> you could run collectl and then after you reset the system log back in 
> and look at what was happening right before you did the reset.  this 
> will let you look at cpu, interrupts, memory, network and a variety of 
> other things including lustre level stats such as I/O rates and even rpc 
> stats.  you'll also be able to see what processes were running in a 
> similar format to ps or you can just play back the data with the --top 
> switch.  if you feel 10 second samples aren't frequent enough you always 
> set you interval down to 1 second or even lower...
> -mark
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

Gesellschaft für Schwerionenforschung mbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführer: Professor Dr. Horst Stöcker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt