[Lustre-discuss] [RESOLVED] Strange MDS Problem + Resolution

Aaron Knister aaron.knister at gmail.com
Sun Sep 27 15:46:14 PDT 2009


I wanted to post this here so in the event that anybody else stumbles  
across this problem they don't spend hours banging their head against  
a brick wall. I was helping with a lustre disk setup that kept  
crashing. The lustre filesystem would hang and there would be one  
thread (ll_mdt_[0-9]*) that would be pegged at 100% of the cpu. It  
turns out there was some on disk inconsistencies as a result of the  
MDS crashing because it ran out of memory. A simple fsck of the MDT  
fixed the issue, after many hours of attempted debugging. We didn't  
think the problem could be fixed by a simple fsck...but it makes sense.

Here's the call trace-

BUG: soft lockup - CPU#0 stuck for 10s! [ll_mdt_26:12829]
CPU 0:
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U)  
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U) ptlrpc 
(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm 
(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U)  
ib_uverbs(U) ib_umad(U) iw_cxgb3(U) cxgb3(U) ib_ipath(U) mlx4_ib(U)  
mlx4_core(U) ib_mthca(U) ib_mad(U) ib_core(U) crc16(U) ipmi_devintf(U)  
mptctl(U) mptbase(U) ipmi_si(U) ipmi_msghandler(U) dell_rbu(U) autofs4 
(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ipv6(U)  
xfrm_nalgo(U) crypto_api(U) dm_multipath(U) video(U) sbs(U) backlight 
(U) i2c_ec(U) i2c_core(U) button(U) battery(U) asus_acpi(U)  
acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) joydev(U)  
pata_acpi(U) ata_piix(U) libata(U) sr_mod(U) sg(U) shpchp(U) ide_cd(U)  
i5000_edac(U) bnx2(U) serio_raw(U) edac_mc(U) cdrom(U) pcspkr(U)  
dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) usb_storage(U)  
megaraid_sas(U) sd_mod(U) scsi_mod(U) ext3(U) jb
(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 12829, comm: ll_mdt_26 Tainted: G      2.6.18-92.1.10.el5_lustre. 
1.6.6smp #1
RIP: 0010:[<ffffffff887ff8bc>]  [<ffffffff887ff8bc>] :ldiskfs:do_split 
+0x3ec/0x560
RSP: 0018:ffff8103f4fab470  EFLAGS: 00000206
RAX: 0000000000000000 RBX: 0000000000000024 RCX: 0000000000000000
RDX: 0000000000000024 RSI: ffff8103aa719bb0 RDI: ffff8103aa719800
RBP: ffff8103fdd50d30 R08: 383030322e786e39 R09: 0000000031323730
R10: 000000006a3ef844 R11: ffff8103aa719cf8 R12: ffff81018bb81f70
R13: ffff81017ad46f70 R14: ffff810093d3cc10 R15: 0000000000000000
FS:  00002b55d25bc220(0000) GS:ffffffff803eb000(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000f1bff000 CR3: 00000004159d4000 CR4: 00000000000006e0

Call Trace:
  [<ffffffff88800395>] :ldiskfs:ldiskfs_add_entry+0x4f5/0x980
  [<ffffffff8006d8f0>] do_gettimeofday+0x50/0x92
  [<ffffffff88800e36>] :ldiskfs:ldiskfs_add_nondir+0x26/0x90
  [<ffffffff88801756>] :ldiskfs:ldiskfs_create+0xf6/0x140
  [<ffffffff888802ff>] :fsfilt_ldiskfs:fsfilt_ldiskfs_start+0x55f/0x630
  [<ffffffff8003a049>] vfs_create+0xe6/0x158
  [<ffffffff88b10453>] :mds:mds_open+0x15a3/0x332e
  [<ffffffff884c30e8>] :lvfs:entry_set_group_info+0xd8/0x2c0
  [<ffffffff884c33fb>] :lvfs:alloc_entry+0x12b/0x140
  [<ffffffff88666434>] :ko2iblnd:kiblnd_check_sends+0x644/0x7f0
  [<ffffffff88546031>] :obdclass:class_handle2object+0xd1/0x160
  [<ffffffff885a619e>] :ptlrpc:lock_res_and_lock+0xbe/0xe0
  [<ffffffff88aed889>] :mds:mds_reint_rec+0x1d9/0x2b0
  [<ffffffff88b14143>] :mds:mds_open_unpack+0x2f3/0x410
  [<ffffffff88ae08da>] :mds:mds_reint+0x35a/0x420
  [<ffffffff88adef62>] :mds:fixup_handle_for_resent_req+0x52/0x200
  [<ffffffff88ae492c>] :mds:mds_intent_policy+0x48c/0xc40
  [<ffffffff885db765>] :ptlrpc:ptlrpc_prep_set+0x1f5/0x2a0
  [<ffffffff885ab926>] :ptlrpc:ldlm_lock_enqueue+0x186/0x990
  [<ffffffff885a7a24>] :ptlrpc:ldlm_lock_remove_from_lru+0x74/0xe0
  [<ffffffff885cd5c0>] :ptlrpc:ldlm_server_completion_ast+0x0/0x5c0
  [<ffffffff885cae85>] :ptlrpc:ldlm_handle_enqueue+0xca5/0x12a0
  [<ffffffff885cdb80>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x6b2
  [<ffffffff88ae9115>] :mds:mds_handle+0x4035/0x4cf0
  [<ffffffff80143a09>] __next_cpu+0x19/0x28
  [<ffffffff80089ab6>] find_busiest_group+0x20d/0x621




More information about the lustre-discuss mailing list