[Lustre-discuss] How to evict a dead client?

huangql huangql at ihep.ac.cn
Wed Jul 7 00:13:32 PDT 2010


Dear, everyone

We have stuck with the problem that the OSS  connect one dead client or one with changed IP address all the time until we reboot the dead client. From the OSS log message, we can get  the information as follows:

Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.cLustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 last message repeated 35 times
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_lauLustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 last message repeated 2 times
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:91Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:07 com01 kernel: Lustre: 12180:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-202.122.37.79 at tcp
Jul  7 14:45:11 com01 last message repeated 188807 times
Jul  7 14:45:11 com01 kernel: BUG: soft lockup - CPU#15 stuck for 10s! [ll_ost_118:12180]Jul  7 14:45:11 com01 kernel: CPU 15:
Jul  7 14:45:11 com01 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) crc16(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) parport_pc(U) lp(U) parport(U) ixgbe(U) pcspkr(U) shpchp(U) serio_raw(U) hpilo(U) sg(U) bnx2(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) usb_storage(U) lpfc(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Jul  7 14:45:11 com01 kernel: Pid: 12180, comm: ll_ost_118 Tainted: G   M  2.6.18-128.7.1.el5_lustre.1.8.1.1 #1Jul  7 14:45:11 com01 kernel: RIP: 0010:[<ffffffff8006dce9>]  [<ffffffff8006dce9>] do_gettimeoffset_tsc+0x8/0x39
Jul  7 14:45:11 com01 kernel: RSP: 0018:ffff8102797b92c0  EFLAGS: 00000202
Jul  7 14:45:11 com01 kernel: RAX: 00000000000106a5 RBX: ffff8102797b9300 RCX: 00000000009ce3bd
Jul  7 14:45:11 com01 kernel: RDX: 00000000bfebfbff RSI: 0000000000000100 RDI: ffff8102797b9300Jul  7 14:45:11 com01 kernel: RBP: 0000000000000733 R08: 0000000000000000 R09: 0000000000000800
Jul  7 14:45:11 com01 kernel: R10: ffffffff8867dc7b R11: ffffffff8867dc4e R12: ffffffff88677379
Jul  7 14:45:11 com01 kernel: R13: ffff8104f46af953 R14: 00000000000006ad R15: 00000000ffffffff
Jul  7 14:45:11 com01 kernel: FS:  00002ab88e27f220(0000) GS:ffff81061fcf78c0(0000) knlGS:0000000000000000
......
And we find one CPU stuck.
[root at com01 ~]# grep CPU#5 /var/log/messages

Jul  7 04:28:59 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:29:43 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:30:03 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:30:23 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:30:45 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:31:25 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:32:52 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:33:12 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]
Jul  7 04:33:55 com01 kernel: BUG: soft lockup - CPU#5 stuck for 10s! [ll_ost_118:12180]

I think the the process "ll_ost_118:12180" is charge of the connection between the client 12345-202.122.37.79 at tcp and the OSS, Because we changed the client with another IP address, but the OSS couldn't recognize it and still connect the original IP. For this reason, our monitoring gives an alarm for us on and off, as the monitor can't ping through the OSS(hostname called com01) with the command "lctl ping com01" ,but after several seconds or minutes it works well. And we found the cpu_idle an cpu_wio with serrated graph showed from ganglia monitoring. You can have a look from the attachment.
For this problem, although we always reboot the dead client, then the OSS works well, we found it's not very rationable for the Lustre file system especially the specific case such as the OSS connecting the non-existent IP. Shall we have some command or some other methods to evict the dead client or unknown IP manually?
We really appreciate for your any help!

                                                                                                   
Best Regards
QiuLan Huang
2010-07-07
====================================================================
Computing center,the Institute of High Energy Physics, China
Huang, Qiulan                        Tel: (+86) 10 8823 6012-604
P.O. Box 918-7                       Fax: (+86) 10 8823 6839
Beijing 100049  P.R. China           Email: huangql at ihep.ac.cn
=================================================================== 
                          
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100707/7a492170/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_idle.bmp
Type: application/octet-stream
Size: 2414778 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100707/7a492170/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cpu_wio.bmp
Type: application/octet-stream
Size: 235078 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100707/7a492170/attachment-0001.obj>


More information about the lustre-discuss mailing list