<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=us-ascii" http-equiv=Content-Type>
<META name=GENERATOR content="MSHTML 8.00.6001.18852"><LINK rel=stylesheet
href="BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em}"></HEAD>
<BODY style="MARGIN: 10px; FONT-FAMILY: verdana; FONT-SIZE: 10pt">
<DIV><FONT face=Verdana>Hi,dearlist</FONT></DIV>
<DIV> </DIV>
<DIV>We got a high frequency of logining node crash these days. The scene is
that we can't remote access to the nodes but we can access them in the terminal.
We run the command :</DIV>
<DIV>ps -ef |grep ldlm</DIV>
<DIV> </DIV>
<DIV>
<DIV>root 3823 1 0 10:57 ? 00:00:00 [ldlm_bl_00]</DIV>
<DIV>root 3824 1 0 10:57 ? 00:00:00 [ldlm_bl_01]</DIV>
<DIV>root 3825 1 0 10:57 ? 00:00:00 [ldlm_bl_02]</DIV>
<DIV>root 3826 1 0 10:57 ? 00:00:00 [ldlm_bl_03]</DIV>
<DIV>root 3827 1 0 10:57 ? 00:00:00 [ldlm_bl_04]</DIV>
<DIV>root 3828 1 0 10:57 ? 00:00:00 [ldlm_bl_05]</DIV>
<DIV>root 3829 1 0 10:57 ? 00:00:00 [ldlm_bl_06]</DIV>
<DIV>root 3830 1 0 10:57 ? 00:00:00 [ldlm_bl_07]</DIV>
<DIV>root 3831 1 0 10:57 ? 00:00:00 [ldlm_cn_00]</DIV>
<DIV>root 3832 1 0 10:57 ? 00:00:00 [ldlm_cn_01]</DIV>
<DIV>root 3834 1 0 10:57 ? 00:00:00 [ldlm_cn_02]</DIV>
<DIV>root 3835 1 0 10:57 ? 00:00:00 [ldlm_cn_03]</DIV>
<DIV>root 3836 1 0 10:57 ? 00:00:00 [ldlm_cn_04]</DIV>
<DIV>root 3837 1 0 10:57 ? 00:00:00 [ldlm_cn_05]</DIV>
<DIV>root 3838 1 0 10:57 ? 00:00:00 [ldlm_cn_06]</DIV>
<DIV>root 3839 1 0 10:57 ? 00:00:00 [ldlm_cn_07]</DIV>
<DIV>root 3840 1 0 10:57 ? 00:00:00 [ldlm_cb_00]</DIV>
<DIV>root 3841 1 0 10:57 ? 00:00:00 [ldlm_cb_01]</DIV>
<DIV>root 3842 1 0 10:57 ? 00:00:00 [ldlm_cb_02]</DIV>
<DIV>root 3843 1 0 10:57 ? 00:00:00 [ldlm_cb_03]</DIV>
<DIV>root 3844 1 0 10:57 ? 00:00:00 [ldlm_cb_04]</DIV>
<DIV>root 3845 1 0 10:57 ? 00:00:00 [ldlm_cb_05]</DIV>
<DIV>root 3846 1 0 10:57 ? 00:00:00 [ldlm_cb_06]</DIV>
<DIV>root 3847 1 0 10:57 ? 00:00:00 [ldlm_cb_07]</DIV></DIV>
<DIV>.</DIV>
<DIV>.</DIV>
<DIV>.</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>we can see many processes about ldlm, it's up to hundreds. As a result, the
load avarage is too high (155.0 165.0 145.0) to work normally.However, we have
no idea and have to restart the nodes. At the same time,we can get the log
as follows:</DIV>
<DIV>The filesystem features:</DIV>
<DIV> </DIV>
<DIV>Server: lustre 1.6.6</DIV>
<DIV>Client: lustre 1.6.5</DIV>
<DIV> </DIV>
<DIV>Can someone else get the same problem? </DIV>
<DIV>I will appreciate for your any help!</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway</DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages</DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11</DIV>
<DIV> 12745,1 92%</DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages</DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11</DIV>
<DIV>Nov 6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar messages</DIV>
<DIV>Nov 6 09:24:14 lxslc22 hm[4390]: Server went down, finding new server.</DIV>
<DIV>Nov 6 09:24:49 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 09:25:04 lxslc22 last message repeated 3 times</DIV>
<DIV>Nov 6 09:25:08 lxslc22 kernel: Lustre: Request x1111842 sent from MGC192.168.50.32@tcp to NID 192.168.50.32@tcp 500s ago has timed out (limit 500s).</DIV>
<DIV>Nov 6 09:25:08 lxslc22 kernel: Lustre: Skipped 29 previous similar messages</DIV>
<DIV>Nov 6 09:25:09 lxslc22 hm[4390]: Server went down, finding new server.</DIV>
<DIV>Nov 6 09:25:44 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 09:26:49 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:27:09 lxslc22 last message repeated 4 times</DIV>
<DIV>Nov 6 09:27:13 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: tried all connections, increasing latency to 51s</DIV>
<DIV>Nov 6 09:27:13 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) Skipped 17 previous similar messages</DIV>
<DIV>Nov 6 09:27:14 lxslc22 hm[4390]: Server went down, finding new server.</DIV>
<DIV>Nov 6 09:27:49 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 09:28:54 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:29:59 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:31:04 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:32:09 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:33:14 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:34:19 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 09:34:54 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway</DIV>
<DIV>Nov 6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages</DIV>
<DIV>Nov 6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11</DIV>
<DIV>Nov 6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar messages</DIV>
<DIV>
<DIV>Nov 6 10:02:38 lxslc22 kernel: Lustre: Request x1113069 sent from besfs-OST0008-osc-f7e14200 to NID 192.168.50.40@tcp 500s ago has timed out (limit 500s).</DIV>
<DIV>Nov 6 10:02:38 lxslc22 kernel: Lustre: Skipped 18 previous similar messages</DIV>
<DIV>Nov 6 10:02:39 lxslc22 hm[4390]: Server went down, finding new server.</DIV>
<DIV>Nov 6 10:03:14 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 10:04:19 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 10:04:39 lxslc22 last message repeated 4 times</DIV>
<DIV>Nov 6 10:04:43 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: tried all connections, increasing latency to 51s</DIV>
<DIV>Nov 6 10:04:43 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) Skipped 34 previous similar messages</DIV>
<DIV>Nov 6 10:04:44 lxslc22 hm[4390]: Server went down, finding new server.</DIV>
<DIV>Nov 6 10:05:19 lxslc22 last message repeated 7 times</DIV>
<DIV>Nov 6 10:06:24 lxslc22 last message repeated 13 times</DIV>
<DIV>Nov 6 10:06:44 lxslc22 last message repeated 4 times</DIV>
<DIV>Nov 6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway</DIV>
<DIV>Nov 6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 9 previous similar messages</DIV>
<DIV>Nov 6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11</DIV>
<DIV>Nov 6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 9 previous similar messages</DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV>Thanks,</DIV>
<DIV>Sarea</DIV></DIV></DIV>
<DIV><FONT face=Verdana></FONT> </DIV>
<DIV align=left><FONT color=#c0c0c0 face=Verdana>2009-11-06 </FONT></DIV><FONT
face=Verdana>
<HR style="WIDTH: 122px; HEIGHT: 2px" align=left SIZE=2>
<DIV><FONT color=#c0c0c0 face=Verdana><SPAN>huangql</SPAN>
</FONT></DIV></FONT></BODY></HTML>