[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Mon Jun 2 00:40:55 PDT 2008

what about dmesg output? it's unlikely lustre debug can help here as the problem
seem to be very internal to ldiskfs (mballoc piece of it)

thanks, Alex

Aaron Knister wrote:
> I bumped up debugging, and here's  (below) the last bit of debugging 
> info from lustre that I have on the oss before it went belly up. My 
> system is totally inoperable. Does anybody have any ideas?
> 
> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) 
> About to add this lock:
> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())  
> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 
> 6378)
> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())   
> Node: local
> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())   
> Resource: ffff8103849211c0 (3129942/0)
> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())   
> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())   
> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) 
> forcing cancel of local lock
> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) 
> kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)
> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) 
> started handle ffff8103766dfc78 (0000000000000000)
> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) 
> committing handle ffff8103766dfc78
> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) 
> started handle ffff8103766dfc78 (0000000000000000)
> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) 
> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) 
> kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)
> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) 
> kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).
> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) 
> putref res: ffff8103849211c0 count: 0
> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) 
> kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).
> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) 
> slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).
> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) 
> kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).
> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) 
> committing handle ffff8103766dfc78
> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) 
> putting 3129942: ffff8103599cea98, count = 0
> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) 
> crew4-OST0001: after destroy: set last_objids[0] = 3129941
> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) 
> crew4-OST0001: server last_objid for group 0: 3129941
> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) 
> slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).
> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) 
> last_committed 17896268, xid 3841
> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) 
> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) 
> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) 
> kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).
> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) 
> kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).
> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) 
> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) 
> RPC PUTting export ffff8103848e9000 : new rpc_count 0
> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) 
> PUTting export ffff8103848e9000 : new refcount 4
> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) 
> PUTting export ffff8103848e9000 : new refcount 3
> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) 
> kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).
> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) 
> kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).
> 
> 
> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com 
> <mailto:aaron.knister at gmail.com>> wrote:
> 
>     Thank you very much for looking into this. I've attached my dmesg to
>     the bug. i looked at line number 1334 which the panic seems to
>     reference. i can't figure out what its doing though
> 
>     On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>     <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote:
> 
>         Aaron Knister wrote:
> 
>             I'm seeing this bug (14465) under heavy load on my OSSes. If
>             I reboot the MDS it appears to help...any ideas? What's the
>             status on this bug?
> 
> 
>         could you attach your dmesg to the bug? as for the status - I'm
>         still not
>         able to reproduce this, neither I found possible cause, sorry.
> 
>         thanks, Alex
> 
> 
>