[Lustre-discuss] Kernel bug in combination with bonding
Tom Woezel
twoezel at it.dcs.ch
Tue Jun 16 03:57:27 PDT 2009
Dear all,
Currently we are running a lustre environment with 2 servers for MGS
and MDTs and 3 OSDs, all Sun x4140 with RedHat EL5 and Lustre 1.6.7.
Recently we decided to go for bonding on the 3 OSDs. We bonded all 4
interfaces together and so far the configuration working. Today I
recognized that one of the OSDs is showing weird behavior and some of
the clients having problems connecting to the filesystem. From what I
have learned so far this is a known kernel bug with this kernel
version (http://bugs.centos.org/view.php?id=3095) and I couldn't find
a solution for this.
I was wondering if any of you has encountered a similar problem and if
so, how did you fix it?
Current Kernel is:
[root at sososd1 ~]# uname -a
Linux sososd1 2.6.18-92.1.17.el5_lustre.1.6.7smp #1 SMP Mon Feb 9
19:56:55 MST 2009 x86_64 x86_64 x86_64 GNU/Linux
The bondig configuration:
[root at sososd1 ~]# cat /etc/modprobe.conf
alias eth0 forcedeth
alias eth1 forcedeth
alias eth2 forcedeth
alias eth3 forcedeth
alias bond0 bonding
options bond0 miimon=100 mode=4
alias scsi_hostadapter aacraid
alias scsi_hostadapter1 sata_nv
alias scsi_hostadapter2 qla2xxx
alias scsi_hostadapter3 usb-storage
options lnet networks="tcp(bond0)"
[root at sososd1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
IPADDR=xxx.xxx.xxx.xxx
NETMASK=xxx.xxx.xxx.xxx
NETWORK=xxx.xxx.xxx.xxx
BROADCAST=xxx.xxx.xxx.xxx
GATEWAY=xxx.xxx.xxx.xxx
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
And each of the interfaces is configured like this:
[root at sososd1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
# nVidia Corporation MCP55 Ethernet
DEVICE=eth0
ONBOOT=yes
BOOTPROTO=none
USERCTL=no
MASTER=bond0
SLAVE=yes
And this is a extract from the log file:
Jun 16 04:33:38 sososd1 kernel: BUG: soft lockup - CPU#2 stuck for
10s! [bond0:3914]
Jun 16 04:33:38 sososd1 kernel: CPU 2:
Jun 16 04:33:38 sososd1 kernel: Modules linked in: obdfilter(U)
fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U)
mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclas
s(U) lnet(U) lvfs(U) libcfs(U) ipv6(U) xfrm_nalgo(U) crypto_api(U)
autofs4(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) hidp(U)
rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) bonding(U) d
m_rdac(U) dm_round_robin(U) dm_multipath(U) video(U) sbs(U)
backlight(U) i2c_ec(U) button(U) battery(U) asus_acpi(U)
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U) i2c
_nforce2(U) sr_mod(U) cdrom(U) pata_acpi(U) i2c_core(U) forcedeth(U)
sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U)
usb_storage(U) qla2xxx(U) scsi_transport_fc(U) sata_
nv(U) libata(U) shpchp(U) aacraid(U) sd_mod(U) scsi_mod(U) ext3(U)
jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Jun 16 04:33:38 sososd1 kernel: Pid: 3914, comm: bond0 Tainted: G
2.6.18-92.1.17.el5_lustre.1.6.7smp #1
Jun 16 04:33:38 sososd1 kernel: RIP: 0010:[<ffffffff80064b4c>]
[<ffffffff80064b4c>] .text.lock.spinlock+0x2/0x30
Jun 16 04:33:38 sososd1 kernel: RSP: 0018:ffff81012b993d98 EFLAGS:
00000286
Jun 16 04:33:38 sososd1 kernel: RAX: 0000000000000001 RBX:
ffff81012b97a080 RCX: 0000000000000004
Jun 16 04:33:38 sososd1 kernel: RDX: ffff81012b97a000 RSI:
ffff81012b97a080 RDI: ffff81012b97a168
Jun 16 04:33:38 sososd1 kernel: RBP: ffff81012b993d10 R08:
0000000000000000 R09: ffff810226ad5d28
Jun 16 04:33:38 sososd1 kernel: R10: 000000fe0000009a R11:
ffff810227efcae0 R12: ffffffff8005dc8e
Jun 16 04:33:38 sososd1 kernel: R13: ffff81010e39d81e R14:
ffffffff80076fd7 R15: ffff81012b993d10
Jun 16 04:33:38 sososd1 kernel: FS: 00002abdd36dc220(0000)
GS:ffff810104159240(0000) knlGS:00000000f7f928d0
Jun 16 04:33:38 sososd1 kernel: CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Jun 16 04:33:38 sososd1 kernel: CR2: 00002aaaac009000 CR3:
0000000000201000 CR4: 00000000000006e0
Jun 16 04:33:38 sososd1 kernel:
Jun 16 04:33:38 sososd1 kernel: Call Trace:
Jun 16 04:33:38 sososd1 kernel: <IRQ>
[<ffffffff883f0477>] :bonding:ad_rx_machine+0x20/0x502
Jun 16 04:33:38 sososd1 kernel:
[<ffffffff883f0aa2>] :bonding:bond_3ad_lacpdu_recv+0xc1/0x1fc
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80046717>] try_to_wake_up
+0x407/0x418
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80020139>] netif_receive_skb
+0x330/0x3ae
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8020c75b>] pci_mmcfg_read
+0x4a/0xbb
Jun 16 04:33:38 sososd1 kernel: [<ffffffff800302f5>] process_backlog
+0x84/0xe1
Jun 16 04:33:38 sososd1 kernel:
[<ffffffff883f0e76>] :bonding:bond_3ad_state_machine_handler+0x0/0x84a
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8000c52c>] net_rx_action
+0xa4/0x1a4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80011ec2>] __do_softirq
+0x5e/0xd6
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80154d15>]
end_msi_irq_w_maskbit+0xf/0x1c
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8005e2fc>] call_softirq
+0x1c/0x28
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8006c67e>] do_softirq+0x2c/
0x85
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8006c506>] do_IRQ+0xec/0xf5
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8005d615>] ret_from_intr
+0x0/0xa
Jun 16 04:33:38 sososd1 kernel: <EOI> [<ffffffff800649d8>] _spin_lock
+0x3/0xa
Jun 16 04:33:38 sososd1 kernel:
[<ffffffff883f0477>] :bonding:ad_rx_machine+0x20/0x502
Jun 16 04:33:38 sososd1 kernel:
[<ffffffff883f0f4a>] :bonding:bond_3ad_state_machine_handler+0xd4/0x84a
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8004cd5b>] run_workqueue
+0x94/0xe4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80049666>] worker_thread
+0x0/0x122
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8009dba2>]
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80049756>] worker_thread
+0xf0/0x122
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8008abb9>]
default_wake_function+0x0/0xe
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8009dba2>]
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8009dba2>]
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff80032409>] kthread+0xfe/0x132
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8009dba2>]
keventd_create_kthread+0x0/0xc4
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8003230b>] kthread+0x0/0x132
Jun 16 04:33:38 sososd1 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11
A restart of the network didn't work and the machine did not respond
on console afterwards. After the reboot of the machine the error was
gone but from what I have found on the web it will appear again.
Thanks in advance for any help.
Kind regards
-----------------------------------------------------------------
Tom Woezel | DCS Contractor in DMO/OTS/SOS Group
Office 2001 ESO/IPP | System Administrator
Tel.:+49-89-32006-184 |
Fax.:+49-89-32006-677 | Address:
| European Southern Observatory
mailto:twoezel at it.dcs.ch | Karl-Schwarzschild-Strasse 2
| D-85748 Garching bei Munchen, Germany
web: http://www.dcs.ch | http://www.eso.org
-----------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090616/ca7dbecb/attachment.htm>
More information about the lustre-discuss
mailing list