I bumped up debugging, and here's  (below) the last bit of debugging info from lustre that I have on the oss before it went belly up. My system is totally inoperable. Does anybody have any ideas?<br><br>00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) About to add this lock:<br>

00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())  -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 6378)<br>00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())   Node: local<br>

00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())   Resource: ffff8103849211c0 (3129942/0)<br>00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())   Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000<br>

00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())   Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)<br>00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) forcing cancel of local lock<br>

00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)<br>00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) started handle ffff8103766dfc78 (0000000000000000)<br>

00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) committing handle ffff8103766dfc78<br>00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) started handle ffff8103766dfc78 (0000000000000000)<br>

00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash<br>00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)<br>

00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).<br>00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) putref res: ffff8103849211c0 count: 0<br>

00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).<br>00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).<br>

00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).<br>00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) committing handle ffff8103766dfc78<br>

00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) putting 3129942: ffff8103599cea98, count = 0<br>00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) crew4-OST0001: after destroy: set last_objids[0] = 3129941<br>

00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) crew4-OST0001: server last_objid for group 0: 3129941<br>00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).<br>

00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) last_committed 17896268, xid 3841<br>00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) connection=ffff8103fbe9e2c0 refcount 10 to <a href="http://172.18.0.10">172.18.0.10</a>@o2ib<br>

00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) conn=ffff8103fbe9e2c0 id 12345-172.18.0.10@o2ib<br>00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).<br>

00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).<br>00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) connection=ffff8103fbe9e2c0 refcount 9 to <a href="http://172.18.0.10">172.18.0.10</a>@o2ib<br>

00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) RPC PUTting export ffff8103848e9000 : new rpc_count 0<br>00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) PUTting export ffff8103848e9000 : new refcount 4<br>

00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) PUTting export ffff8103848e9000 : new refcount 3<br>00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).<br>

00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).<br><br><br><div class="gmail_quote">On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <<a href="mailto:aaron.knister@gmail.com">aaron.knister@gmail.com</a>> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">Thank you very much for looking into this. I've attached my dmesg to

the bug. i looked at line number 1334 which the panic seems to

reference. i can't figure out what its doing though<br><br></div><div class="gmail_quote"><div class="Ih2E3d">On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <<a href="mailto:Alex.Zhuravlev@sun.com" target="_blank">Alex.Zhuravlev@sun.com</a>> wrote:<br>


</div><div><div></div><div class="Wj3C7c"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div>Aaron Knister wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

I'm seeing this bug (14465) under heavy load on my OSSes. If I reboot the MDS it appears to help...any ideas? What's the status on this bug?<br>

</blockquote>

<br></div>

could you attach your dmesg to the bug? as for the status - I'm still not<br>

able to reproduce this, neither I found possible cause, sorry.<br>

<br>

thanks, Alex<br>

<br>

</blockquote></div></div></div><br>

</blockquote></div><br>