[Lustre-discuss] Frequent OSS Crashes with heavy load

wanglu wanglu at ihep.ac.cn
Sun Nov 9 22:50:15 PST 2008


Dear list, 

     Our Lustre system crashes frequently these days with heavy average load. 

1)#top
 top - 14:32:57 up 18:15,  1 user,  load average: 25.05, 24.27, 24.47
Mem:   8307364k total,   859724k used,  7447640k free,   234288k buffers
Swap: 16386292k total,        0k used, 16386292k free,    37932k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                              
26695 root      15   0     0    0    0 S  7.6  0.0  51:57.40 socknal_sd04                                                                                                         
26694 root      15   0     0    0    0 S  6.6  0.0  53:44.42 socknal_sd03                                                                                                         
26691 root      15   0     0    0    0 S  5.6  0.0  51:11.76 socknal_sd00                                                                                                         
26697 root      15   0     0    0    0 S  5.3  0.0  42:12.23 socknal_sd06                                                                                                         
26696 root      15   0     0    0    0 S  3.3  0.0  52:47.42 socknal_sd05                                                                                                         
26692 root      15   0     0    0    0 S  2.3  0.0  26:19.46 socknal_sd01                                                                                                         
26693 root      15   0     0    0    0 S  2.3  0.0  32:38.21 socknal_sd02                                                                                                         
26952 root      15   0     0    0    0 S  1.0  0.0   2:06.69 ll_ost_io_09                                                                                                         
....


2) iostat -x 5 
Linux 2.6.9-67.0.7.EL_lustre.1.6.5smp (boss01.ihep.ac.cn)       11/10/2008

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00   11.33    4.56   84.10

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0   1.05   0.43  0.27  0.41    9.78    6.65     4.89     3.32    24.31     0.01   17.15   5.78   0.39
sda          3.46   0.64 1297.05  0.60 2588.12  118.12  1294.06    59.06     2.09    22.81   15.69   0.77  99.57
sdb          3.09   0.28 1274.46  0.18 1541.21   23.54   770.60    11.77     1.23    16.75   12.16   0.78  99.56

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00   11.53    0.10   88.38

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0   0.00   1.80  0.00  0.00    0.00   16.00     0.00     8.00     0.00     0.00    0.00   0.00   0.00
sda          3.20   0.00 1436.60  0.00 130524.80    0.00 65262.40     0.00    90.86    16.29   10.73   0.70 100.00
sdb          3.40   0.00 1142.20  0.00 124113.60    0.00 62056.80     0.00   108.66    10.44    8.24   0.87  99.80


Before each crashes, there are LustreError like:

Nov  9 17:25:41 boss01 kernel: LustreError: 27327:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  req at e3df8e00 x133017/t0 o3->73c15254-a884-578e-9634-859b44619a4f at NET_0x20000c0a83446_UUID:0/0 lens 400/336 e 0 to 0 dl 1226222741 ref 1 fl Interpret:/0/0 rc 0/0
Nov  9 17:25:41 boss01 kernel: Lustre: 27327:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0005: ignoring bulk IO comm error with 73c15254-a884-578e-9634-859b44619a4f at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov  9 17:27:47 boss01 kernel: Lustre: besfs-OST0006: haven't heard from client 73c15254-a884-578e-9634-859b44619a4f (at 192.168.52.70 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov  9 17:27:48 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client 73c15254-a884-578e-9634-859b44619a4f (at 192.168.52.70 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov  9 09:28:05 boss01 sshd[29314]: Connection closed by 192.168.51.130
Nov  9 17:29:17 boss01 ntpd[27872]: kernel time sync enabled 0001
Nov  9 17:56:48 boss01 kernel: Lustre: besfs-OST0005: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
Nov  9 17:56:48 boss01 kernel: Lustre: Skipped 2 previous similar messages
Nov  9 17:59:15 boss01 kernel: Lustre: besfs-OST0002: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 374 seconds. I think it's dead, and I am evicting it.
Nov  9 17:59:18 boss01 kernel: LustreError: 27250:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  req at e2ccee00 x36870/t0 o3->7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID:0/0 lens 400/336 e 0 to 0 dl 1226224758 ref 1 fl Interpret:/0/0 rc 0/0
Nov  9 17:59:18 boss01 kernel: LustreError: 27250:0:(ost_handler.c:868:ost_brw_read()) Skipped 2 previous similar messages
Nov  9 17:59:18 boss01 kernel: Lustre: 27250:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0007: ignoring bulk IO comm error with 7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov  9 17:59:18 boss01 kernel: Lustre: 27250:0:(ost_handler.c:925:ost_brw_read()) Skipped 2 previous similar messages
Nov  9 17:59:18 boss01 kernel: LustreError: 29507:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 100+0s  req at e01bce00 x36866/t0 o3->7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID:0/0 lens 432/336 e 0 to 0 dl 1226224758 ref 1 fl Interpret:/0/0 rc 0/0
Nov  9 17:59:18 boss01 kernel: LustreError: 29507:0:(ost_handler.c:868:ost_brw_read()) Skipped 4 previous similar messages
Nov  9 17:59:18 boss01 kernel: Lustre: 29507:0:(ost_handler.c:925:ost_brw_read()) besfs-OST0005: ignoring bulk IO comm error with 7df31bbf-54a5-ada8-abd7-f0920f648d0a at NET_0x20000c0a83446_UUID id 12345-192.168.52.70 at tcp - client will retry
Nov  9 17:59:18 boss01 kernel: Lustre: 29507:0:(ost_handler.c:925:ost_brw_read()) Skipped 4 previous similar messages
Nov  9 18:01:33 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client c06ff22f-03a6-3897-ec32-1f26f6958e8b (at 202.122.33.83 at tcp) in 512 seconds. I think it's dead, and I am evicting it.
Nov  9 18:04:14 boss01 kernel: Lustre: besfs-OST0007: haven't heard from client 7df31bbf-54a5-ada8-abd7-f0920f648d0a (at 192.168.52.70 at tcp) in 396 seconds. I think it's dead, and I am evicting it.

The configuration of our system
OS:Linux 2.6.9-67.0.7.EL_lustre.1.6.5smp
MDS:1
OSS:2 with 10Gbit/s NIC, each attached with 2 disk arrays directly. 
Client: 50 nodes( 8 core server), each has 1Gbit/s NIC

and 

[root at boss02 ~]# sysctl -q lnet
lnet.nis = nid                      refs peer   max    tx   min
lnet.nis = 0 at lo                        2    0     0     0     0
lnet.nis = 192.168.50.34 at tcp         136    8   256   250    88
lnet.buffers = pages count credits     min
lnet.buffers =     0     0       0       0
lnet.buffers =     1     0       0       0
lnet.buffers =   256     0       0       0
lnet.peers = nid                      refs state   max   rtr   min    tx   min queue
lnet.peers = 192.168.50.14 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.11 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.13 at tcp           1  ~rtr     8     8     8     8   -71 0
lnet.peers = 192.168.52.14 at tcp           1  ~rtr     8     8     8     8    -8 0
lnet.peers = 192.168.52.15 at tcp           1  ~rtr     8     8     8     8   -14 0
lnet.peers = 192.168.52.16 at tcp           1  ~rtr     8     8     8     8   -30 0
lnet.peers = 192.168.52.17 at tcp           1  ~rtr     8     8     8     8   -38 0
lnet.peers = 192.168.52.18 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.19 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.20 at tcp           1  ~rtr     8     8     8     8   -19 0
lnet.peers = 192.168.52.21 at tcp           1  ~rtr     8     8     8     8     3 0
lnet.peers = 192.168.52.22 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.50.32 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.23 at tcp           1  ~rtr     8     8     8     8    -6 0
lnet.peers = 192.168.52.24 at tcp           1  ~rtr     8     8     8     8   -50 0
lnet.peers = 192.168.52.25 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.26 at tcp           1  ~rtr     8     8     8     8    -2 0
lnet.peers = 192.168.52.27 at tcp           1  ~rtr     8     8     8     8   -31 0
lnet.peers = 192.168.52.28 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.29 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.30 at tcp           1  ~rtr     8     8     8     8   -31 0
lnet.peers = 192.168.52.31 at tcp           7  ~rtr     8     8     8     2   -10 3318192
lnet.peers = 192.168.52.32 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.33 at tcp           1  ~rtr     8     8     8     8    -6 0
lnet.peers = 192.168.52.34 at tcp           1  ~rtr     8     8     8     8    -4 0
lnet.peers = 192.168.52.35 at tcp           1  ~rtr     8     8     8     8    -2 0
lnet.peers = 192.168.52.36 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.37 at tcp           1  ~rtr     8     8     8     8   -55 0
lnet.peers = 192.168.52.38 at tcp           1  ~rtr     8     8     8     8   -62 0
lnet.peers = 192.168.52.39 at tcp           1  ~rtr     8     8     8     8    -8 0
lnet.peers = 192.168.52.40 at tcp           1  ~rtr     8     8     8     8    -5 0
lnet.peers = 192.168.52.41 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.42 at tcp           1  ~rtr     8     8     8     8    -4 0
lnet.peers = 192.168.52.43 at tcp           1  ~rtr     8     8     8     8   -31 0
lnet.peers = 192.168.52.44 at tcp           1  ~rtr     8     8     8     8   -14 0
lnet.peers = 192.168.52.45 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.46 at tcp           1  ~rtr     8     8     8     8    -3 0
lnet.peers = 192.168.52.47 at tcp           1  ~rtr     8     8     8     8   -10 0
lnet.peers = 192.168.52.48 at tcp           1  ~rtr     8     8     8     8   -23 0
lnet.peers = 192.168.52.49 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.50 at tcp           1  ~rtr     8     8     8     8    -3 0
lnet.peers = 192.168.52.51 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.52 at tcp           1  ~rtr     8     8     8     8   -23 0
lnet.peers = 192.168.52.53 at tcp           1  ~rtr     8     8     8     8    -5 0
lnet.peers = 192.168.52.54 at tcp           1  ~rtr     8     8     8     8   -20 0
lnet.peers = 192.168.52.55 at tcp           1  ~rtr     8     8     8     8    -5 0
lnet.peers = 192.168.52.56 at tcp           1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.57 at tcp           1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.58 at tcp           1  ~rtr     8     8     8     8   -11 0
lnet.peers = 192.168.52.59 at tcp           1  ~rtr     8     8     8     8    -4 0
lnet.peers = 192.168.52.60 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.61 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.62 at tcp           1  ~rtr     8     8     8     8   -19 0
lnet.peers = 192.168.52.63 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.64 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.65 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.66 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.67 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.68 at tcp           1  ~rtr     8     8     8     8     3 0
lnet.peers = 192.168.52.69 at tcp           1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.70 at tcp           1  ~rtr     8     8     8     8    -8 0
lnet.peers = 192.168.52.71 at tcp           1  ~rtr     8     8     8     8    -2 0
lnet.peers = 192.168.52.72 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.73 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.74 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.75 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 202.122.33.56 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.76 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.77 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.78 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.79 at tcp           1  ~rtr     8     8     8     8    -3 0
lnet.peers = 192.168.52.80 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.81 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.82 at tcp           1  ~rtr     8     8     8     8     3 0
lnet.peers = 192.168.52.83 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.84 at tcp           1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.86 at tcp           1  ~rtr     8     8     8     8   -12 0
lnet.peers = 192.168.52.87 at tcp           1  ~rtr     8     8     8     8     3 0
lnet.peers = 192.168.52.88 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.89 at tcp           1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.90 at tcp           1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.91 at tcp           1  ~rtr     8     8     8     8     3 0
lnet.peers = 192.168.52.92 at tcp           1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.93 at tcp           1  ~rtr     8     8     8     8   -14 0
lnet.peers = 192.168.52.94 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.95 at tcp           1  ~rtr     8     8     8     8   -19 0
lnet.peers = 192.168.52.96 at tcp           1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.97 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.98 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.52.99 at tcp           1  ~rtr     8     8     8     8    -3 0
lnet.peers = 192.168.52.100 at tcp          1  ~rtr     8     8     8     8    -4 0
lnet.peers = 192.168.52.101 at tcp          1  ~rtr     8     8     8     8     4 0
lnet.peers = 202.122.33.82 at tcp           1  ~rtr     8     8     8     8 -6383 0
lnet.peers = 192.168.52.102 at tcp          1  ~rtr     8     8     8     8    -1 0
lnet.peers = 202.122.33.83 at tcp           1  ~rtr     8     8     8     8    -6 0
lnet.peers = 192.168.52.103 at tcp          1  ~rtr     8     8     8     8     2 0
lnet.peers = 202.122.33.84 at tcp           1  ~rtr     8     8     8     8  -649 0
lnet.peers = 192.168.52.104 at tcp          1  ~rtr     8     8     8     8    -6 0
lnet.peers = 192.168.52.105 at tcp          1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.106 at tcp          1  ~rtr     8     8     8     8   -15 0
lnet.peers = 192.168.52.107 at tcp          1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.108 at tcp          1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.109 at tcp          1  ~rtr     8     8     8     8   -79 0
lnet.peers = 192.168.52.110 at tcp          1  ~rtr     8     8     8     8   -24 0
lnet.peers = 192.168.52.111 at tcp          1  ~rtr     8     8     8     8  -102 0
lnet.peers = 192.168.52.112 at tcp          1  ~rtr     8     8     8     8     0 0
lnet.peers = 202.122.33.92 at tcp           1  ~rtr     8     8     8     8 -1148 0
lnet.peers = 202.122.33.93 at tcp           1  ~rtr     8     8     8     8    -5 0
lnet.peers = 192.168.52.113 at tcp          1  ~rtr     8     8     8     8   -55 0
lnet.peers = 192.168.52.114 at tcp          1  ~rtr     8     8     8     8   -73 0
lnet.peers = 192.168.52.115 at tcp          1  ~rtr     8     8     8     8    -6 0
lnet.peers = 202.122.33.95 at tcp           1  ~rtr     8     8     8     8 -1914 0
lnet.peers = 192.168.52.116 at tcp          1  ~rtr     8     8     8     8    -4 0
lnet.peers = 192.168.52.117 at tcp          1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.118 at tcp          1  ~rtr     8     8     8     8   -55 0
lnet.peers = 192.168.52.119 at tcp          1  ~rtr     8     8     8     8    -1 0
lnet.peers = 192.168.52.120 at tcp          1  ~rtr     8     8     8     8     1 0
lnet.peers = 192.168.52.121 at tcp          1  ~rtr     8     8     8     8   -54 0
lnet.peers = 192.168.52.122 at tcp          1  ~rtr     8     8     8     8   -65 0
lnet.peers = 192.168.52.123 at tcp          1  ~rtr     8     8     8     8   -16 0
lnet.peers = 192.168.52.124 at tcp          1  ~rtr     8     8     8     8   -32 0
lnet.peers = 192.168.52.125 at tcp          1  ~rtr     8     8     8     8  -158 0
lnet.peers = 192.168.52.126 at tcp          1  ~rtr     8     8     8     8     0 0
lnet.peers = 192.168.52.127 at tcp          1  ~rtr     8     8     8     8    -2 0
lnet.peers = 192.168.52.128 at tcp          1  ~rtr     8     8     8     8   -36 0
lnet.peers = 192.168.52.129 at tcp          1  ~rtr     8     8     8     8  -120 0
lnet.peers = 192.168.52.130 at tcp          1  ~rtr     8     8     8     8     2 0
lnet.peers = 192.168.52.131 at tcp          1  ~rtr     8     8     8     8   -82 0
lnet.peers = 192.168.55.11 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.55.12 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.55.13 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.55.14 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.55.15 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.55.16 at tcp           1  ~rtr     8     8     8     8     4 0
lnet.peers = 192.168.51.134 at tcp          1  ~rtr     8     8     8     8  -631 0
lnet.routers = ref  rtr_ref alive_cnt  state    last_ping router
lnet.routes = Routing disabled
lnet.routes = net      hops   state router
lnet.stats = 7 6513 0 349123954 349123978 0 25 9871897726514 80688968391 0 7600
lnet.debug_mb = 41
lnet.panic_on_lbug = 0
lnet.catastrophe = 0
lnet.memused = 4166984
lnet.upcall = /usr/lib/lustre/lnet_upcall
lnet.debug_path = /tmp/lustre-log
lnet.console_backoff = 2
lnet.console_min_delay_centisecs = 50
lnet.console_max_delay_centisecs = 60000
lnet.console_ratelimit = 1
lnet.printk = warning error emerg console
lnet.subsystem_debug = undefined mdc mds osc ost class log llite rpc lnet lnd pinger filter echo ldlm lov lmv sec gss mgc mgs fid fld
lnet.debug = ioctl neterror warning error emerg ha config console


My questions is:
1.What is the signal of the Lustre overload?
2. Can Lustre reject too many connections before it is going to crash?  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20081110/e2beadab/attachment.htm>


More information about the lustre-discuss mailing list