[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Alex Zhuravlev Alex.Zhuravlev at sun.com
Thu Jun 5 06:04:28 PDT 2008


https://bugzilla.lustre.org/show_bug.cgi?id=15932

thanks, Alex

Aaron Knister wrote:
> For my own curiosity's sake, is there a bugzilla id associated with it? 
> I'm curious to know what the problem is/was.
> 
> On Jun 5, 2008, at 8:50 AM, Alex Zhuravlev wrote:
> 
>> we've been investigating one issue which can be a source of your problem.
>> fortunately we can reproduce that.
>>
>> thanks, Alex
>>
>> Aaron Knister wrote:
>>> Unfortunately the monkeys maintaining this setup (I left the company
>>> last week) have managed to destroy one of the 13 terabyte arrays...so
>>> they've got bigger issues right now. Thanks for your help though.
>>>
>>> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:
>>>
>>>> hmm. could you provide me with ldiskfs.ok then?
>>>>
>>>> thanks, Alex
>>>>
>>>> Aaron Knister wrote:
>>>>> I'm not sure how to obtain that. I'm running whatever was distributed
>>>>> with 1.6.4.3 for rhel5-x86_64.
>>>>>
>>>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
>>>>>
>>>>>> could you also send me your mballoc.c please?
>>>>>>
>>>>>> thanks, Alex
>>>>>>
>>>>>> Aaron Knister wrote:
>>>>>>> Thanks so much for looking into this. Here's what I got from dmesg.
>>>>>>> Interestingly enough every time it panics the CPU listed is #4...do
>>>>>>> you think that points to a hardware problem? -
>>>>>>> ----------- [cut here ] --------- [please bite here ] ---------
>>>>>>> Kernel BUG at
>>>>>>> ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
>>>>>>> invalid opcode: 0000 [1] SMP
>>>>>>> last sysfs file:
>>>>>>> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> CPU 4
>>>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)
>>>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U)
>>>>>>> ptlrpc(U) obdclass(
>>>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U)
>>>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) 
>>>>>>> xt_tcpudp(U)
>>>>>>> ip6table_filter(U
>>>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U)
>>>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U)
>>>>>>> rdma_cm(U) ib_cm(U) iw_c
>>>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U)
>>>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U)
>>>>>>> button(U) battery(
>>>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
>>>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U)
>>>>>>> ib_core(U) shpchp(U) i2c
>>>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U)
>>>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U)
>>>>>>> sata_sil(U) libata(U) sd_mod(U
>>>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
>>>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF
>>>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>>>>>> RIP: 0010:[<ffffffff8885a01f>]  [<ffffffff8885a01f>]
>>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f
>>>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000
>>>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010
>>>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470
>>>>>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800
>>>>>>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0
>>>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task
>>>>>>> ffff81041228a040)
>>>>>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b
>>>>>>> 0000000000010000
>>>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e
>>>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000
>>>>>>> Call Trace:
>>>>>>> [<ffffffff8885cd4e>] 
>>>>>>> :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
>>>>>>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>>>>>>> [<ffffffff88859664>]
>>>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>>>>>>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>>>>>> [<ffffffff888bf091>]
>>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>>>>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
>>>>>>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>>>>>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>>>>>> [<ffffffff888bed90>]
>>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>>>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>>>>>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>>>>>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0
>>>>>>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570
>>>>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>>>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>>>>>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160
>>>>>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>>>>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
>>>>>>> [<ffffffff886a4a3b>] 
>>>>>>> :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
>>>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>>>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>>>>>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
>>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>>>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>>>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>>>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43
>>>>>>> RIP  [<ffffffff8885a01f>]
>>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>>> RSP <ffff81040e4dd320>
>>>>>>> <0>Kernel panic - not syncing: Fatal exception
>>>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>>>>>> what about dmesg output? it's unlikely lustre debug can help 
>>>>>>>> here as
>>>>>>>> the problem
>>>>>>>> seem to be very internal to ldiskfs (mballoc piece of it)
>>>>>>>>
>>>>>>>> thanks, Alex
>>>>>>>>
>>>>>>>> Aaron Knister wrote:
>>>>>>>>> I bumped up debugging, and here's  (below) the last bit of 
>>>>>>>>> debugging
>>>>>>>>> info from lustre that I have on the oss before it went belly 
>>>>>>>>> up. My
>>>>>>>>> system is totally inoperable. Does anybody have any ideas?
>>>>>>>>>
>>>>>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About to add this lock:
>>>>>>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) 
>>>>>>>>> (pos:0)(pid:6378)
>>>>>>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Node: local
>>>>>>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags:
>>>>>>>>> 0x80004000
>>>>>>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
>>>>>>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> forcing cancel of local lock
>>>>>>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)
>>>>>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6
>>>>>>>>> from hash
>>>>>>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)
>>>>>>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).
>>>>>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> putref res: ffff8103849211c0 count: 0
>>>>>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).
>>>>>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).
>>>>>>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).
>>>>>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> putting 3129942: ffff8103599cea98, count = 0
>>>>>>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>>>>>>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> crew4-OST0001: server last_objid for group 0: 3129941
>>>>>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).
>>>>>>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> last_committed 17896268, xid 3841
>>>>>>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>>>>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).
>>>>>>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).
>>>>>>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>>>>>>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PUTting export ffff8103848e9000 : new refcount 4
>>>>>>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PUTting export ffff8103848e9000 : new refcount 3
>>>>>>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).
>>>>>>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister
>>>>>>>>> <aaron.knister at gmail.com
>>>>>>>>> <mailto:aaron.knister at gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you very much for looking into this. I've attached my
>>>>>>>>> dmesg to
>>>>>>>>> the bug. i looked at line number 1334 which the panic seems to
>>>>>>>>> reference. i can't figure out what its doing though
>>>>>>>>>
>>>>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>>>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote:
>>>>>>>>>
>>>>>>>>>     Aaron Knister wrote:
>>>>>>>>>
>>>>>>>>>         I'm seeing this bug (14465) under heavy load on my
>>>>>>>>> OSSes. If
>>>>>>>>>         I reboot the MDS it appears to help...any ideas? What's 
>>>>>>>>> the
>>>>>>>>>         status on this bug?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     could you attach your dmesg to the bug? as for the status - 
>>>>>>>>> I'm
>>>>>>>>>     still not
>>>>>>>>>     able to reproduce this, neither I found possible cause, sorry.
>>>>>>>>>
>>>>>>>>>     thanks, Alex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Lustre-discuss mailing list
>>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 




More information about the lustre-discuss mailing list