[Lustre-discuss] A Failed client soft lockup one OSS

Lu Wang wanglu at ihep.ac.cn
Fri Mar 26 00:36:58 PDT 2010


Dear   list,
	
	We find bug on Lustre 1.8.1.1. Sometimes one client's dead may cause soft lockup on OSS. The certain OSS may reach a high CPU System% usage, and then became unreachable through "lctl ping" from now and then. We find the system log as
follow:
       
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:02:11 boss34 last message repeated 36 times
Mar 23 01:02:11 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:05:45 boss34 kernel: CPU 11:
Mar 23 01:05:45 boss34 kernel: Modules linked in: autofs4(U) hidp(U) obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) sr_mod(U) cdrom(U) sg(U) shpchp(U) ixgbe(U) pcspkr(U) serio_raw(U) hpilo(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) libata(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Mar 23 01:05:45 boss34 kernel: Pid: 5781, comm: ll_ost_405 Tainted: G      2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
Mar 23 01:05:45 boss34 kernel: RIP: 0010:[<ffffffff8006dc7e>]  [<ffffffff8006dc7e>] do_gettimeofday+0x2c/0x8f
Mar 23 01:05:45 boss34 kernel: RSP: 0018:ffff8103065f5200  EFLAGS: 00000246
Mar 23 01:05:45 boss34 kernel: RAX: 0000000000000001 RBX: ffff8103065f5230 RCX: 000000003021bf6a Mar 23 01:05:45 boss34 kernel: RDX: 0008b6f19d2513c6 RSI: 000000000008b6f1 RDI: ffff8103065f5230
Mar 23 01:05:45 boss34 kernel: RBP: ffff8105f74f1600 R08: 0000000000000000 R09: 0000000000000567 Mar 23 01:05:45 boss34 kernel: R10: ffffffff88708c7b R11: ffffffff88708c4e R12: 000000003021b1eb
Mar 23 01:05:45 boss34 kernel: R13: 0000000000000018 R14: ffff8103065f5170 R15: 0000000000000206Mar 23 01:05:45 boss34 kernel: FS:  00002ac00ee25220(0000) GS:ffff81032a9d7540(0000) knlGS:0000000000000000
Mar 23 01:05:45 boss34 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Mar 23 01:05:45 boss34 kernel: CR2: 00000000f7f7b000 CR3: 0000000000201000 CR4: 00000000000006e0
Mar 23 01:05:45 boss34 kernel:
Mar 23 01:05:45 boss34 kernel: Call Trace:
Mar 23 01:05:45 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:05:46 boss34 last message repeated 37 times
Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp
Mar 23 01:05:46 boss34 last message repeated 36 times
Mar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:Lustre: 5781:0Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcpMar 23 01:05:46 boss34 kernel: Lustre: 5781:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.57.170 at tcp

The certain ll_ost may be scheduled to differnt CPU, and cause CPU stuck:
[root at boss34 ~]# grep stuck /var/log/messages
Mar 23 01:05:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:14:10 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:18:19 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:22:28 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:30:36 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:30:46 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:30:56 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:34:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:34:55 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:35:15 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:43:03 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:47:12 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:47:43 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:51:41 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:55:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 01:59:39 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:03:58 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:07:57 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:08:07 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:08:27 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:12:06 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:16:15 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:16:25 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:16:35 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:16:45 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:20:24 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:20:34 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:24:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:29:02 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:32:50 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:37:20 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:37:30 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:41:09 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:45:18 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:49:33 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:53:46 boss34 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [ll_ost_405:5781]
Mar 23 02:57:45 boss34 kernel: BUG: soft lockup - CPU#11 stuck for 10s! [ll_ost_405:5781]
Mar 23 03:01:56 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781]
Mar 23 03:06:38 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781]
Mar 23 03:10:15 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781]
Mar 23 03:10:35 boss34 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_405:5781]
Mar 23 03:14:21 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781
....
Mar 23 05:27:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 05:27:29 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 05:31:18 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 05:35:37 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
....
Mar 23 06:04:30 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 06:04:40 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 06:05:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 06:08:39 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 06:09:09 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
....
Mar 23 06:42:22 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]
Mar 23 06:46:00 boss34 kernel: BUG: soft lockup - CPU#7 stuck for 10s! [ll_ost_405:5781]




The situation ended until we restarted the certain lustre client node. Is it possible avoiding this problem? 



Best Regards
Lu Wang
--------------------------------------------------------------	  
Computing Center
IHEP						Office: Computing Center,123 
19B Yuquan Road				Tel: (+86) 10 88236012-607
P.O. Box 918-7				Fax: (+86) 10 8823 6839
Beijing 100049,China		Email: Lu.Wang at ihep.ac.cn							
--------------------------------------------------------------   				
                          





More information about the lustre-discuss mailing list