[Lustre-discuss] OSS crashes
Thomas Roth
t.roth at gsi.de
Thu Jul 24 09:24:11 PDT 2008
Well, guess what - I did that. These OSS are all running collectl
already ;-)
At first I had in addition to the collectl daemon an xterm with
"collectl -sL -od" which at least gave me the point in time when the
machine stopped. Didn't make me any wiser. At least there was no Lustre
write activity any more at the time of the crash.
Last try I added 'c' and 'n' and found that in the last minute the CPU
load had risen to 78. No Lustre or network activity, though. That's a
bit much but it would be a pity if sufficient to crash a server:
### RECORD 27098 >>> lxfs89 <<< (1216869685.007) (Thu Jul 24 05:21:25
2008) ###
# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER NICE SYS WAIT IRQ SOFT STEAL IDLE INTR
CTXSW PROC RUNQ RUN AVG1 AVG5 AVG15
07/24 05:21:25 0 0 37 10 0 0 0 51 462
281 0 538 16 79.11 78.64 76.93
# LUSTRE FILESYSTEM SINGLE OST STATISTICS
# Ost KBRead Reads KBWrite Writes
07/24 05:21:25 OST0015 0 0 0 0
07/24 05:21:25 OST0016 0 0 0 0
# NETWORK STATISTICS (/sec)
# Num Name KBIn PktIn SizeIn MultI CmpI ErrIn
KBOut PktOut SizeO CmpO ErrOut
07/24 05:21:25 0 lo: 0 0 0 0 0 0
0 0 0 0 0
07/24 05:21:25 1 eth0: 28 385 76 0 0 0
6 29 230 0 0
07/24 05:21:25 2 eth1: 0 0 0 0 0 0
0 0 0 0 0
07/24 05:21:25 3 eth2: 0 0 0 0 0 0
0 0 0 0 0
Then I actually started reading man pages and was able to extract some
info from the collectl - logfile. -scnL told me that writing to the disk
(to the log) stopped 5 hours before the last activity I'd seen on the
xterm. The CPU load was 37 at that moment, and there was still packages
coming in and being written to the two OSTs:
### RECORD 126 >>> lxfs89 <<< (1216851732.479) (Thu Jul 24 00:22:12
2008) ###
# CPU SUMMARY (INTR, CTXSW & PROC /sec)
# USER NICE SYS WAIT IRQ SOFT STEAL IDLE INTR CTXSW PROC
RUNQ RUN AVG1 AVG5 AVG15
0 0 4 19 0 0 0 75 729 3302 0
471 9 36.87 35.96 34.36
# LUSTRE FILESYSTEM SINGLE OST STATISTICS
#Ost KBRead Reads KBWrite Writes
OST0015 0 0 2751 2
OST0016 0 0 2955 2
# NETWORK STATISTICS (/sec)
#Num Name KBIn PktIn SizeIn MultI CmpI ErrIn KBOut PktOut
SizeO CmpO ErrOut
0 lo: 0 0 0 0 0 0 0 0
0 0 0
1 eth0: 5139 3559 1478 0 0 0 116 1475
81 0 0
2 eth1: 0 0 0 0 0 0 0 0
0 0 0
3 eth2: 0 0 0 0 0 0 0 0
0 0 0
However, I have not yet gotten further in learning the abilities of
collectl or the interpretation of its info.
In another xterm window I had htop running, though. This stopped with
three 100% processes on top, ll_ost_io_42, ll_ost_io_59, ll_ost_io_01,
each of which had been running for 4h 58m. Fits the 5 hour gap mentioned
earlier.
Still I don't have a clue as to what actually causes this behavior and
how to avoid.
On the next crash I'll try to get a stack trace, and logging the console
to more than the xterm buffer surely is something we ought to do as well.
Thanks for your advice,
Thomas
Mark Seger wrote:
>>> Where else could I look for overloaded hardware capacities?
>>>
>> Not sure. That's quite hardware specific.
>>
> you could run collectl and then after you reset the system log back in
> and look at what was happening right before you did the reset. this
> will let you look at cpu, interrupts, memory, network and a variety of
> other things including lustre level stats such as I/O rates and even rpc
> stats. you'll also be able to see what processes were running in a
> similar format to ps or you can just play back the data with the --top
> switch. if you feel 10 second samples aren't frequent enough you always
> set you interval down to 1 second or even lower...
> -mark
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453 Fax: +49-6159-71 2986
Gesellschaft für Schwerionenforschung mbH
Planckstraße 1
D-64291 Darmstadt
www.gsi.de
Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528
Geschäftsführer: Professor Dr. Horst Stöcker
Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
More information about the lustre-discuss
mailing list