[Lustre-discuss] Kernel bug in combination with bonding

Tom Woezel twoezel at it.dcs.ch
Tue Jun 16 03:57:27 PDT 2009


Dear all,

Currently we are running a lustre environment with 2 servers for MGS  
and MDTs and 3 OSDs, all Sun x4140 with RedHat EL5 and Lustre 1.6.7.  
Recently we decided to go for bonding on the 3 OSDs. We bonded all 4  
interfaces together and so far the configuration working. Today I  
recognized that one of the OSDs is showing weird behavior and some of  
the clients having problems connecting to the filesystem. From what I  
have learned so far this is a known kernel bug with this kernel  
version (http://bugs.centos.org/view.php?id=3095) and I couldn't find  
a solution for this.

I was wondering if any of you has encountered a similar problem and if  
so, how did you fix it?

Current Kernel is:

[root at sososd1 ~]# uname -a
Linux sososd1 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9  
19:56:55 MST 2009 x86_64 x86_64 x86_64 GNU/Linux

The bondig configuration:

[root at sososd1 ~]# cat /etc/modprobe.conf
alias eth0 forcedeth
alias eth1 forcedeth
alias eth2 forcedeth
alias eth3 forcedeth
alias bond0 bonding
options bond0 miimon=100 mode=4
alias scsi_hostadapter aacraid
alias scsi_hostadapter1 sata_nv
alias scsi_hostadapter2 qla2xxx
alias scsi_hostadapter3 usb-storage
options lnet networks="tcp(bond0)"

[root at sososd1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
NETWORK=xxx.xxx.xxx.xxx
BROADCAST=xxx.xxx.xxx.xxx
GATEWAY=xxx.xxx.xxx.xxx
ONBOOT=yes
BOOTPROTO=none
USERCTL=no

And each of the interfaces is configured like this:

[root at sososd1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# nVidia Corporation MCP55 Ethernet
DEVICE=eth0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes

And this is a extract from the log file:

Jun 16 04:33:38 sososd1 kernel: BUG: soft lockup - CPU#2 stuck for  
10s! [bond0:3914]
Jun 16 04:33:38 sososd1 kernel: CPU 2:
Jun 16 04:33:38 sososd1 kernel: Modules linked in: obdfilter(U)  
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)  
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclas
s(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)  
autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) hidp(U)  
rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) bonding(U) d
m_rdac(U) dm_round_robin(U) dm_multipath(U) video(U) sbs(U)  
backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c
_nforce2(U) sr_mod(U) cdrom(U) pata_acpi(U) i2c_core(U) forcedeth(U)  
sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)  
usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_
nv(U) libata(U) shpchp(U) aacraid(U) sd_mod(U) scsi_mod(U) ext3(U)  
jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Jun 16 04:33:38 sososd1 kernel: Pid: 3914, comm: bond0 Tainted: G       
2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Jun 16 04:33:38 sososd1 kernel: RIP: 0010:[<ffffffff80064b4c>]   
[<ffffffff80064b4c>] .text.lock.spinlock+0x2/0x30
Jun 16 04:33:38 sososd1 kernel: RSP: 0018:ffff81012b993d98  EFLAGS:  
00000286
Jun 16 04:33:38 sososd1 kernel: RAX: 0000000000000001 RBX:  
ffff81012b97a080 RCX: 0000000000000004
Jun 16 04:33:38 sososd1 kernel: RDX: ffff81012b97a000 RSI:  
ffff81012b97a080 RDI: ffff81012b97a168
Jun 16 04:33:38 sososd1 kernel: RBP: ffff81012b993d10 R08:  
0000000000000000 R09: ffff810226ad5d28
Jun 16 04:33:38 sososd1 kernel: R10: 000000fe0000009a R11:  
ffff810227efcae0 R12: ffffffff8005dc8e
Jun 16 04:33:38 sososd1 kernel: R13: ffff81010e39d81e R14:  
ffffffff80076fd7 R15: ffff81012b993d10
Jun 16 04:33:38 sososd1 kernel: FS:  00002abdd36dc220(0000)  
GS:ffff810104159240(0000) knlGS:00000000f7f928d0
Jun 16 04:33:38 sososd1 kernel: CS:  0010 DS: 0018 ES: 0018 CR0:  
000000008005003b
Jun 16 04:33:38 sososd1 kernel: CR2: 00002aaaac009000 CR3:  
0000000000201000 CR4: 00000000000006e0
Jun 16 04:33:38 sososd1 kernel:
Jun 16 04:33:38 sososd1 kernel: Call Trace:
Jun 16 04:33:38 sososd1 kernel:  <IRQ>   
[<ffffffff883f0477>] :bonding:ad_rx_machine+0x20/0x502
Jun 16 04:33:38 sososd1 kernel:   
[<ffffffff883f0aa2>] :bonding:bond_3ad_lacpdu_recv+0xc1/0x1fc
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80046717>] try_to_wake_up 
+0x407/0x418
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80020139>] netif_receive_skb 
+0x330/0x3ae
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8020c75b>] pci_mmcfg_read 
+0x4a/0xbb
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff800302f5>] process_backlog 
+0x84/0xe1
Jun 16 04:33:38 sososd1 kernel:   
[<ffffffff883f0e76>] :bonding:bond_3ad_state_machine_handler+0x0/0x84a
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8000c52c>] net_rx_action 
+0xa4/0x1a4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80011ec2>] __do_softirq 
+0x5e/0xd6
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80154d15>]  
end_msi_irq_w_maskbit+0xf/0x1c
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8005e2fc>] call_softirq 
+0x1c/0x28
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8006c67e>] do_softirq+0x2c/ 
0x85
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8006c506>] do_IRQ+0xec/0xf5
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8005d615>] ret_from_intr 
+0x0/0xa
Jun 16 04:33:38 sososd1 kernel:  <EOI>  [<ffffffff800649d8>] _spin_lock 
+0x3/0xa
Jun 16 04:33:38 sososd1 kernel:   
[<ffffffff883f0477>] :bonding:ad_rx_machine+0x20/0x502
Jun 16 04:33:38 sososd1 kernel:   
[<ffffffff883f0f4a>] :bonding:bond_3ad_state_machine_handler+0xd4/0x84a
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8004cd5b>] run_workqueue 
+0x94/0xe4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80049666>] worker_thread 
+0x0/0x122
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8009dba2>]  
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80049756>] worker_thread 
+0xf0/0x122
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8008abb9>]  
default_wake_function+0x0/0xe
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8009dba2>]  
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8009dba2>]  
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff80032409>] kthread+0xfe/0x132
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8009dba2>]  
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8003230b>] kthread+0x0/0x132
Jun 16 04:33:38 sososd1 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11


A restart of the network didn't work and the machine did not respond  
on console afterwards. After the reboot of the machine the error was  
gone but from what I have found on the web it will appear again.

Thanks in advance for any help.

Kind regards
-----------------------------------------------------------------
Tom Woezel                | DCS Contractor in DMO/OTS/SOS Group
Office 2001 ESO/IPP       |   System Administrator
Tel.:+49-89-32006-184     |
Fax.:+49-89-32006-677     | Address:
                           |   European Southern Observatory
mailto:twoezel at it.dcs.ch  |   Karl-Schwarzschild-Strasse 2
                           | D-85748 Garching bei Munchen, Germany
web:  http://www.dcs.ch   |   http://www.eso.org
-----------------------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090616/ca7dbecb/attachment.htm>


More information about the lustre-discuss mailing list