[Lustre-discuss] Frequent OSS Crashes with heavy load
wanglu
wanglu at ihep.ac.cn
Sun Nov 9 22:50:15 PST 2008
Dear list,
Our Lustre system crashes frequently these days with heavy average load.
1)#top
top - 14:32:57 up 18:15, 1 user, load average: 25.05, 24.27, 24.47
Mem: 8307364k total, 859724k used, 7447640k free, 234288k buffers
Swap: 16386292k total, 0k used, 16386292k free, 37932k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26695 root 15 0 0 0 0 S 7.6 0.0 51:57.40 socknal_sd04
26694 root 15 0 0 0 0 S 6.6 0.0 53:44.42 socknal_sd03
26691 root 15 0 0 0 0 S 5.6 0.0 51:11.76 socknal_sd00
26697 root 15 0 0 0 0 S 5.3 0.0 42:12.23 socknal_sd06
26696 root 15 0 0 0 0 S 3.3 0.0 52:47.42 socknal_sd05
26692 root 15 0 0 0 0 S 2.3 0.0 26:19.46 socknal_sd01
26693 root 15 0 0 0 0 S 2.3 0.0 32:38.21 socknal_sd02
26952 root 15 0 0 0 0 S 1.0 0.0 2:06.69 ll_ost_io_09
....
2) iostat -x 5
Linux 2.6.9-67.0.7.EL_lustre.1.6.5smp (boss01.ihep.ac.cn) 11/10/2008
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 11.33 4.56 84.10
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 1.05 0.43 0.27 0.41 9.78 6.65 4.89 3.32 24.31 0.01 17.15 5.78 0.39
sda 3.46 0.64 1297.05 0.60 2588.12 118.12 1294.06 59.06 2.09 22.81 15.69 0.77 99.57
sdb 3.09 0.28 1274.46 0.18 1541.21 23.54 770.60 11.77 1.23 16.75 12.16 0.78 99.56
avg-cpu: %user %nice %sys %iowait %idle
0.00 0.00 11.53 0.10 88.38
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
cciss/c0d0 0.00 1.80 0.00 0.00 0.00 16.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00
sda 3.20 0.00 1436.60 0.00 130524.80 0.00 65262.40 0.00 90.86 16.29 10.73 0.70 100.00
sdb 3.40 0.00 1142.20 0.00 124113.60 0.00 62056.80 0.00 108.66 10.44 8.24 0.87 99.80
Before each crashes, there are LustreError like:
Nov 9 17:25:41 boss01 kernel: LustreError: 27327:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req at e3df8e00 x133017/t0 o3->73c15254-a884-578e-9634-859b44619a4f at NET_0x20000c0a83446_UUID:0/0 lens 400/336 e 0 to 0 dl 1226222741 ref 1 fl Interpret:/0/0 rc 0/0
Nov 9 17:25:41 boss01 kernel: Lustre: 27327:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0005: ignoring bulk IO comm error with 73c15254-a884-578e-9634-859b44619a4f at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov 9 17:27:47 boss01 kernel: Lustre: besfs-OST0006: haven't heard from client 73c15254-a884-578e-9634-859b44619a4f (at 192.168.52.70 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov 9 17:27:48 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client 73c15254-a884-578e-9634-859b44619a4f (at 192.168.52.70 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov 9 09:28:05 boss01 sshd[29314]: Connection closed by 192.168.51.130
Nov 9 17:29:17 boss01 ntpd[27872]: kernel time sync enabled 0001
Nov 9 17:56:48 boss01 kernel: Lustre: besfs-OST0005: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov 9 17:56:48 boss01 kernel: Lustre: Skipped 2 previous similar messages
Nov 9 17:59:15 boss01 kernel: Lustre: besfs-OST0002: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 374 seconds. I think it's dead, and I am evicting it.
Nov 9 17:59:18 boss01 kernel: LustreError: 27250:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req at e2ccee00 x36870/t0 o3->7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID:0/0 lens 400/336 e 0 to 0 dl 1226224758 ref 1 fl Interpret:/0/0 rc 0/0
Nov 9 17:59:18 boss01 kernel: LustreError: 27250:0:(ost_handler.c:868:ost_brw_read()) Skipped 2 previous similar messages
Nov 9 17:59:18 boss01 kernel: Lustre: 27250:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0007: ignoring bulk IO comm error with 7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov 9 17:59:18 boss01 kernel: Lustre: 27250:0:(ost_handler.c:925:ost_brw_read()) Skipped 2 previous similar messages
Nov 9 17:59:18 boss01 kernel: LustreError: 29507:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s req at e01bce00 x36866/t0 o3->7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID:0/0 lens 432/336 e 0 to 0 dl 1226224758 ref 1 fl Interpret:/0/0 rc 0/0
Nov 9 17:59:18 boss01 kernel: LustreError: 29507:0:(ost_handler.c:868:ost_brw_read()) Skipped 4 previous similar messages
Nov 9 17:59:18 boss01 kernel: Lustre: 29507:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0005: ignoring bulk IO comm error with 7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov 9 17:59:18 boss01 kernel: Lustre: 29507:0:(ost_handler.c:925:ost_brw_read()) Skipped 4 previous similar messages
Nov 9 18:01:33 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 512 seconds. I think it's dead, and I am evicting it.
Nov 9 18:04:14 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client 7df31bbf-54a5-ada8-abd7-f0920f648d0a (at 192.168.52.70 at tcp) in 396 seconds. I think it's dead, and I am evicting it.
The configuration of our system
OS:Linux 2.6.9-67.0.7.EL_lustre.1.6.5smp
MDS:1
OSS:2 with 10Gbit/s NIC, each attached with 2 disk arrays directly.
Client: 50 nodes( 8 core server), each has 1Gbit/s NIC
and
[root at boss02 ~]# sysctl -q lnet
lnet.nis = nid refs peer max tx min
lnet.nis = 0 at lo 2 0 0 0 0
lnet.nis = 192.168.50.34 at tcp 136 8 256 250 88
lnet.buffers = pages count credits min
lnet.buffers = 0 0 0 0
lnet.buffers = 1 0 0 0
lnet.buffers = 256 0 0 0
lnet.peers = nid refs state max rtr min tx min queue
lnet.peers = 192.168.50.14 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.11 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.13 at tcp 1 ~rtr 8 8 8 8 -71 0
lnet.peers = 192.168.52.14 at tcp 1 ~rtr 8 8 8 8 -8 0
lnet.peers = 192.168.52.15 at tcp 1 ~rtr 8 8 8 8 -14 0
lnet.peers = 192.168.52.16 at tcp 1 ~rtr 8 8 8 8 -30 0
lnet.peers = 192.168.52.17 at tcp 1 ~rtr 8 8 8 8 -38 0
lnet.peers = 192.168.52.18 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.19 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.20 at tcp 1 ~rtr 8 8 8 8 -19 0
lnet.peers = 192.168.52.21 at tcp 1 ~rtr 8 8 8 8 3 0
lnet.peers = 192.168.52.22 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.50.32 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.23 at tcp 1 ~rtr 8 8 8 8 -6 0
lnet.peers = 192.168.52.24 at tcp 1 ~rtr 8 8 8 8 -50 0
lnet.peers = 192.168.52.25 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.26 at tcp 1 ~rtr 8 8 8 8 -2 0
lnet.peers = 192.168.52.27 at tcp 1 ~rtr 8 8 8 8 -31 0
lnet.peers = 192.168.52.28 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.29 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.30 at tcp 1 ~rtr 8 8 8 8 -31 0
lnet.peers = 192.168.52.31 at tcp 7 ~rtr 8 8 8 2 -10 3318192
lnet.peers = 192.168.52.32 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.33 at tcp 1 ~rtr 8 8 8 8 -6 0
lnet.peers = 192.168.52.34 at tcp 1 ~rtr 8 8 8 8 -4 0
lnet.peers = 192.168.52.35 at tcp 1 ~rtr 8 8 8 8 -2 0
lnet.peers = 192.168.52.36 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.37 at tcp 1 ~rtr 8 8 8 8 -55 0
lnet.peers = 192.168.52.38 at tcp 1 ~rtr 8 8 8 8 -62 0
lnet.peers = 192.168.52.39 at tcp 1 ~rtr 8 8 8 8 -8 0
lnet.peers = 192.168.52.40 at tcp 1 ~rtr 8 8 8 8 -5 0
lnet.peers = 192.168.52.41 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.42 at tcp 1 ~rtr 8 8 8 8 -4 0
lnet.peers = 192.168.52.43 at tcp 1 ~rtr 8 8 8 8 -31 0
lnet.peers = 192.168.52.44 at tcp 1 ~rtr 8 8 8 8 -14 0
lnet.peers = 192.168.52.45 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.46 at tcp 1 ~rtr 8 8 8 8 -3 0
lnet.peers = 192.168.52.47 at tcp 1 ~rtr 8 8 8 8 -10 0
lnet.peers = 192.168.52.48 at tcp 1 ~rtr 8 8 8 8 -23 0
lnet.peers = 192.168.52.49 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.50 at tcp 1 ~rtr 8 8 8 8 -3 0
lnet.peers = 192.168.52.51 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.52 at tcp 1 ~rtr 8 8 8 8 -23 0
lnet.peers = 192.168.52.53 at tcp 1 ~rtr 8 8 8 8 -5 0
lnet.peers = 192.168.52.54 at tcp 1 ~rtr 8 8 8 8 -20 0
lnet.peers = 192.168.52.55 at tcp 1 ~rtr 8 8 8 8 -5 0
lnet.peers = 192.168.52.56 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.57 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.58 at tcp 1 ~rtr 8 8 8 8 -11 0
lnet.peers = 192.168.52.59 at tcp 1 ~rtr 8 8 8 8 -4 0
lnet.peers = 192.168.52.60 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.61 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.62 at tcp 1 ~rtr 8 8 8 8 -19 0
lnet.peers = 192.168.52.63 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.64 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.65 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.66 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.67 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.68 at tcp 1 ~rtr 8 8 8 8 3 0
lnet.peers = 192.168.52.69 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.70 at tcp 1 ~rtr 8 8 8 8 -8 0
lnet.peers = 192.168.52.71 at tcp 1 ~rtr 8 8 8 8 -2 0
lnet.peers = 192.168.52.72 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.73 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.74 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.75 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 202.122.33.56 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.76 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.77 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.78 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.79 at tcp 1 ~rtr 8 8 8 8 -3 0
lnet.peers = 192.168.52.80 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.81 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.82 at tcp 1 ~rtr 8 8 8 8 3 0
lnet.peers = 192.168.52.83 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.84 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.86 at tcp 1 ~rtr 8 8 8 8 -12 0
lnet.peers = 192.168.52.87 at tcp 1 ~rtr 8 8 8 8 3 0
lnet.peers = 192.168.52.88 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.89 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.90 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.91 at tcp 1 ~rtr 8 8 8 8 3 0
lnet.peers = 192.168.52.92 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.93 at tcp 1 ~rtr 8 8 8 8 -14 0
lnet.peers = 192.168.52.94 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.95 at tcp 1 ~rtr 8 8 8 8 -19 0
lnet.peers = 192.168.52.96 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.97 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.98 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.52.99 at tcp 1 ~rtr 8 8 8 8 -3 0
lnet.peers = 192.168.52.100 at tcp 1 ~rtr 8 8 8 8 -4 0
lnet.peers = 192.168.52.101 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 202.122.33.82 at tcp 1 ~rtr 8 8 8 8 -6383 0
lnet.peers = 192.168.52.102 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 202.122.33.83 at tcp 1 ~rtr 8 8 8 8 -6 0
lnet.peers = 192.168.52.103 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 202.122.33.84 at tcp 1 ~rtr 8 8 8 8 -649 0
lnet.peers = 192.168.52.104 at tcp 1 ~rtr 8 8 8 8 -6 0
lnet.peers = 192.168.52.105 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.106 at tcp 1 ~rtr 8 8 8 8 -15 0
lnet.peers = 192.168.52.107 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.108 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.109 at tcp 1 ~rtr 8 8 8 8 -79 0
lnet.peers = 192.168.52.110 at tcp 1 ~rtr 8 8 8 8 -24 0
lnet.peers = 192.168.52.111 at tcp 1 ~rtr 8 8 8 8 -102 0
lnet.peers = 192.168.52.112 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 202.122.33.92 at tcp 1 ~rtr 8 8 8 8 -1148 0
lnet.peers = 202.122.33.93 at tcp 1 ~rtr 8 8 8 8 -5 0
lnet.peers = 192.168.52.113 at tcp 1 ~rtr 8 8 8 8 -55 0
lnet.peers = 192.168.52.114 at tcp 1 ~rtr 8 8 8 8 -73 0
lnet.peers = 192.168.52.115 at tcp 1 ~rtr 8 8 8 8 -6 0
lnet.peers = 202.122.33.95 at tcp 1 ~rtr 8 8 8 8 -1914 0
lnet.peers = 192.168.52.116 at tcp 1 ~rtr 8 8 8 8 -4 0
lnet.peers = 192.168.52.117 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.118 at tcp 1 ~rtr 8 8 8 8 -55 0
lnet.peers = 192.168.52.119 at tcp 1 ~rtr 8 8 8 8 -1 0
lnet.peers = 192.168.52.120 at tcp 1 ~rtr 8 8 8 8 1 0
lnet.peers = 192.168.52.121 at tcp 1 ~rtr 8 8 8 8 -54 0
lnet.peers = 192.168.52.122 at tcp 1 ~rtr 8 8 8 8 -65 0
lnet.peers = 192.168.52.123 at tcp 1 ~rtr 8 8 8 8 -16 0
lnet.peers = 192.168.52.124 at tcp 1 ~rtr 8 8 8 8 -32 0
lnet.peers = 192.168.52.125 at tcp 1 ~rtr 8 8 8 8 -158 0
lnet.peers = 192.168.52.126 at tcp 1 ~rtr 8 8 8 8 0 0
lnet.peers = 192.168.52.127 at tcp 1 ~rtr 8 8 8 8 -2 0
lnet.peers = 192.168.52.128 at tcp 1 ~rtr 8 8 8 8 -36 0
lnet.peers = 192.168.52.129 at tcp 1 ~rtr 8 8 8 8 -120 0
lnet.peers = 192.168.52.130 at tcp 1 ~rtr 8 8 8 8 2 0
lnet.peers = 192.168.52.131 at tcp 1 ~rtr 8 8 8 8 -82 0
lnet.peers = 192.168.55.11 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.55.12 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.55.13 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.55.14 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.55.15 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.55.16 at tcp 1 ~rtr 8 8 8 8 4 0
lnet.peers = 192.168.51.134 at tcp 1 ~rtr 8 8 8 8 -631 0
lnet.routers = ref rtr_ref alive_cnt state last_ping router
lnet.routes = Routing disabled
lnet.routes = net hops state router
lnet.stats = 7 6513 0 349123954 349123978 0 25 9871897726514 80688968391 0 7600
lnet.debug_mb = 41
lnet.panic_on_lbug = 0
lnet.catastrophe = 0
lnet.memused = 4166984
lnet.upcall = /usr/lib/lustre/lnet_upcall
lnet.debug_path = /tmp/lustre-log
lnet.console_backoff = 2
lnet.console_min_delay_centisecs = 50
lnet.console_max_delay_centisecs = 60000
lnet.console_ratelimit = 1
lnet.printk = warning error emerg console
lnet.subsystem_debug = undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lmv sec gss mgc mgs fid fld
lnet.debug = ioctl neterror warning error emerg ha config console
My questions is:
1.What is the signal of the Lustre overload?
2. Can Lustre reject too many connections before it is going to crash?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081110/e2beadab/attachment.htm>
More information about the lustre-discuss
mailing list