[Lustre-discuss] LBUG in lustre 1.8.1 when client mounts something with bind option
Daniel Basabe
dbasabe at soporte.cti.csic.es
Thu Aug 20 03:43:36 PDT 2009
Hi
Recently I upgrade lustre to 1.8.1 from 1.6.6. Before, I've never got problems
with lustre.
My clients mount the lustre filesystem under /clusterha, at this point
everything works ok. But when I try, for example, this:
# mount -o bind,rw /clusterha/home /home
It produces an LBUG in the MGS:
LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2())
ASSERTION(msg->lm_bufcount > segment) failed
LustreError: 5164:0:(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG
Lustre: 5164:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for
process 5164
ll_mdt_18 R running task 0 5164 1 5165 5163 (L-TLB)
0000000000000000 ffffffff887f6d5a ffff8104173e0280 ffffffff887f646e
ffffffff887f6462 0000000000000086 0000000000000002 ffffffff801616e5
0000000000000001 0000000000000000 ffffffff802f6aa0 0000000000000000
Call Trace:
[<ffffffff8009daf8>] autoremove_wake_function+0x9/0x2e
[<ffffffff80088819>] __wake_up_common+0x3e/0x68
[<ffffffff80088819>] __wake_up_common+0x3e/0x68
[<ffffffff8002e6ba>] __wake_up+0x38/0x4f
[<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b
[<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b
[<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b
[<ffffffff800a540a>] kallsyms_lookup+0xc2/0x17b
[<ffffffff8006bb5d>] printk_address+0x9f/0xab
[<ffffffff8008f800>] printk+0x8/0xbd
[<ffffffff8008f84a>] printk+0x52/0xbd
[<ffffffff800a2e08>] module_text_address+0x33/0x3c
[<ffffffff8009c088>] kernel_text_address+0x1a/0x26
[<ffffffff8006b843>] dump_trace+0x211/0x23a
[<ffffffff8006b8a0>] show_trace+0x34/0x47
[<ffffffff8006b9a5>] _show_stack+0xdb/0xea
[<ffffffff887ebada>] :libcfs:lbug_with_loc+0x7a/0xd0
[<ffffffff887f3c70>] :libcfs:tracefile_init+0x0/0x110
[<ffffffff8894c218>] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
[<ffffffff88c53529>] :mds:mds_getattr_lock+0xc59/0xce0
[<ffffffff8894aea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110
[<ffffffff8883c923>] :lnet:lnet_ni_send+0x93/0xd0
[<ffffffff8883ed23>] :lnet:lnet_send+0x973/0x9a0
[<ffffffff88c4dfca>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
[<ffffffff88c59a76>] :mds:mds_intent_policy+0x636/0xc10
[<ffffffff8890d6f6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
[<ffffffff8890ad46>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30
[<ffffffff88926acf>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0
[<ffffffff88889e48>] :obdclass:lustre_hash_add+0x218/0x2e0
[<ffffffff8892f530>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
[<ffffffff8892d669>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
[<ffffffff88c57630>] :mds:mds_handle+0x4080/0x4cb0
[<ffffffff80148d4f>] __next_cpu+0x19/0x28
[<ffffffff80148d4f>] __next_cpu+0x19/0x28
[<ffffffff80088f32>] find_busiest_group+0x20d/0x621
[<ffffffff8894fa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
[<ffffffff80089d89>] enqueue_task+0x41/0x56
[<ffffffff8895472d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
[<ffffffff88956e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
[<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
[<ffffffff80088819>] __wake_up_common+0x3e/0x68
[<ffffffff8895a908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
[<ffffffff8008a3ef>] default_wake_function+0x0/0xe
[<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff889596f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
[<ffffffff8005dfa7>] child_rip+0x0/0x11
LustreError: dumping log to /tmp/lustre-log.1250760001.5164
Lustre: 0:0:(watchdog.c:181:lcw_cb()) Watchdog triggered for pid 5164: it was
inactive for 200.00s
Lustre: 0:0:(linux-debug.c:264:libcfs_debug_dumpstack()) showing stack for
process 5164
ll_mdt_18 D ffff81000102df80 0 5164 1 5165 5163 (L-TLB)
ffff810411625810 0000000000000046 0000000000000000 0000000000000000
ffff8104116257d0 0000000000000009 ffff810413360080 ffff81042fe9d100
00008b3a197b862f 0000000000000ed5 ffff810413360268 000000050000028f
Call Trace:
[<ffffffff8008a3ef>] default_wake_function+0x0/0xe
[<ffffffff887ebb26>] :libcfs:lbug_with_loc+0xc6/0xd0
[<ffffffff887f3c70>] :libcfs:tracefile_init+0x0/0x110
[<ffffffff8894c218>] :ptlrpc:lustre_shrink_reply_v2+0xa8/0x240
[<ffffffff88c53529>] :mds:mds_getattr_lock+0xc59/0xce0
[<ffffffff8894aea4>] :ptlrpc:lustre_msg_add_version+0x34/0x110
[<ffffffff8883c923>] :lnet:lnet_ni_send+0x93/0xd0
[<ffffffff8883ed23>] :lnet:lnet_send+0x973/0x9a0
[<ffffffff88c4dfca>] :mds:fixup_handle_for_resent_req+0x5a/0x2c0
[<ffffffff88c59a76>] :mds:mds_intent_policy+0x636/0xc10
[<ffffffff8890d6f6>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3a0
[<ffffffff8890ad46>] :ptlrpc:ldlm_lock_enqueue+0x186/0xb30
[<ffffffff88926acf>] :ptlrpc:ldlm_export_lock_get+0x6f/0xe0
[<ffffffff88889e48>] :obdclass:lustre_hash_add+0x218/0x2e0
[<ffffffff8892f530>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x83d
[<ffffffff8892d669>] :ptlrpc:ldlm_handle_enqueue+0xc19/0x1210
[<ffffffff88c57630>] :mds:mds_handle+0x4080/0x4cb0
[<ffffffff80148d4f>] __next_cpu+0x19/0x28
[<ffffffff80148d4f>] __next_cpu+0x19/0x28
[<ffffffff80088f32>] find_busiest_group+0x20d/0x621
[<ffffffff8894fa15>] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
[<ffffffff80089d89>] enqueue_task+0x41/0x56
[<ffffffff8895472d>] :ptlrpc:ptlrpc_check_req+0x1d/0x110
[<ffffffff88956e67>] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1160
[<ffffffff8003dc3f>] lock_timer_base+0x1b/0x3c
[<ffffffff80088819>] __wake_up_common+0x3e/0x68
[<ffffffff8895a908>] :ptlrpc:ptlrpc_main+0x1218/0x13e0
[<ffffffff8008a3ef>] default_wake_function+0x0/0xe
[<ffffffff800b48dd>] audit_syscall_exit+0x327/0x342
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff889596f0>] :ptlrpc:ptlrpc_main+0x0/0x13e0
[<ffffffff8005dfa7>] child_rip+0x0/0x11
LustreError: dumping log to /tmp/lustre-log.1250760201.5164
Lustre: 5162:0:(service.c:786:ptlrpc_at_send_early_reply()) @@@ Couldn't add
any time (5/5), not sending early reply
req at ffff81040f602400 x1311420275108778/t0 o101-
>3f31386b-70e3-8c4f-6ecf-83adfc123156 at NET_0x20000c0a80a03_UUID:0/0 lens
544/600 e 24 to 0 dl 1250760601 ref 2 fl Interpret:/0/0 rc 0/0
And the client hangs.
With 1.6.6 this action worked fine.
Another difference with the previous configuration is that in the new one, I've
created a link agregation in tcp0 device both in the MGS side and in the
client side:
#cat /etc/modprobe.conf
alias eth0 bnx2
alias eth1 bnx2
alias scsi_hostadapter cciss
alias scsi_hostadapter1 ata_piix
alias scsi_hostadapter2 qla2xxx
alias bond0 bonding
options bond0 mode=4
alias ib0 ib_ipoib
alias ib1 ib_ipoib
options lnet accept=all networks=o2ib0(ib0),tcp0(bond0)
alias net-pf-27 ib_sdp
My currently configuration has three OST's connected to MGS with infiniband
(o2ib) and tcp ethernet (bond0):
MGS:
Reading CONFIGS/mountdata
Read previous values:
Target: shared-MDT0000
Index: 0
Lustre FS: shared
Mount type: ldiskfs
Flags: 0x5
(MDT MGS )
Persistent mount opts: errors=remount-ro,iopen_nopriv,user_xattr
Parameters: failover.node=10.0.0.200 at o2ib0,192.168.10.200 at tcp0
mdt.group_upcall=/usr/sbin/l_getgroups
OST 0:
Target: shared-OST0000
Index: 0
Lustre FS: shared
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=10.0.0.7 at o2ib0,192.168.10.7 at tcp0
mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp
mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp
OST 1:
Target: shared-OST0001
Index: 1
Lustre FS: shared
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=10.0.0.6 at o2ib0,192.168.10.6 at tcp0
mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp
mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp
OST 2:
Target: shared-OST0002
Index: 2
Lustre FS: shared
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: failover.node=10.0.0.201 at o2ib0,192.168.10.201 at tcp0
mgsnode=10.0.0.201 at o2ib,192.168.10.201 at tcp
mgsnode=10.0.0.200 at o2ib,192.168.10.200 at tcp
I attach the dump log.
Does anyone know what is happening?
Thanks.
Regards.
--
Daniel Basabe del Pino
------------------------
Administrador de Sistemas HPC
BULL / Secretaría General Adjunta de Informática CSIC
Tlfno: 915642963
Ext: 272
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre-log.1250760001.5164
Type: application/octet-stream
Size: 1052892 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090820/229b3364/attachment.obj>
More information about the lustre-discuss
mailing list