[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Aaron Knister aaron.knister at gmail.com
Fri May 30 08:29:14 PDT 2008


I bumped up debugging, and here's  (below) the last bit of debugging info
from lustre that I have on the oss before it went belly up. My system is
totally inoperable. Does anybody have any ideas?

00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
About to add this lock:
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
-- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid:
6378)
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
Node: local
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
Resource: ffff8103849211c0 (3129942/0)
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
forcing cancel of local lock
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
started handle ffff8103766dfc78 (0000000000000000)
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
committing handle ffff8103766dfc78
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
started handle ffff8103766dfc78 (0000000000000000)
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
putref res: ffff8103849211c0 count: 0
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
committing handle ffff8103766dfc78
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
putting 3129942: ffff8103599cea98, count = 0
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
crew4-OST0001: after destroy: set last_objids[0] = 3129941
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
crew4-OST0001: server last_objid for group 0: 3129941
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
last_committed 17896268, xid 3841
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 at o2ib
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 at o2ib
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
RPC PUTting export ffff8103848e9000 : new rpc_count 0
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
PUTting export ffff8103848e9000 : new refcount 4
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
PUTting export ffff8103848e9000 : new refcount 3
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).


On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com>
wrote:

> Thank you very much for looking into this. I've attached my dmesg to the
> bug. i looked at line number 1334 which the panic seems to reference. i
> can't figure out what its doing though
>
> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com>
> wrote:
>
>> Aaron Knister wrote:
>>
>>> I'm seeing this bug (14465) under heavy load on my OSSes. If I reboot the
>>> MDS it appears to help...any ideas? What's the status on this bug?
>>>
>>
>> could you attach your dmesg to the bug? as for the status - I'm still not
>> able to reproduce this, neither I found possible cause, sorry.
>>
>> thanks, Alex
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080530/0cdf7371/attachment.htm>


More information about the lustre-discuss mailing list