[Lustre-discuss] lustre error about ldlm process

huangql huangql at ihep.ac.cn
Thu Nov 5 19:25:56 PST 2009


Hi,dearlist

We got a high frequency of logining node crash these days. The scene is that we can't remote access to the nodes but we can access them in the terminal. We run the command :
ps -ef |grep ldlm

root      3823     1  0 10:57 ?        00:00:00 [ldlm_bl_00]
root      3824     1  0 10:57 ?        00:00:00 [ldlm_bl_01]
root      3825     1  0 10:57 ?        00:00:00 [ldlm_bl_02]
root      3826     1  0 10:57 ?        00:00:00 [ldlm_bl_03]
root      3827     1  0 10:57 ?        00:00:00 [ldlm_bl_04]
root      3828     1  0 10:57 ?        00:00:00 [ldlm_bl_05]
root      3829     1  0 10:57 ?        00:00:00 [ldlm_bl_06]
root      3830     1  0 10:57 ?        00:00:00 [ldlm_bl_07]
root      3831     1  0 10:57 ?        00:00:00 [ldlm_cn_00]
root      3832     1  0 10:57 ?        00:00:00 [ldlm_cn_01]
root      3834     1  0 10:57 ?        00:00:00 [ldlm_cn_02]
root      3835     1  0 10:57 ?        00:00:00 [ldlm_cn_03]
root      3836     1  0 10:57 ?        00:00:00 [ldlm_cn_04]
root      3837     1  0 10:57 ?        00:00:00 [ldlm_cn_05]
root      3838     1  0 10:57 ?        00:00:00 [ldlm_cn_06]
root      3839     1  0 10:57 ?        00:00:00 [ldlm_cn_07]
root      3840     1  0 10:57 ?        00:00:00 [ldlm_cb_00]
root      3841     1  0 10:57 ?        00:00:00 [ldlm_cb_01]
root      3842     1  0 10:57 ?        00:00:00 [ldlm_cb_02]
root      3843     1  0 10:57 ?        00:00:00 [ldlm_cb_03]
root      3844     1  0 10:57 ?        00:00:00 [ldlm_cb_04]
root      3845     1  0 10:57 ?        00:00:00 [ldlm_cb_05]
root      3846     1  0 10:57 ?        00:00:00 [ldlm_cb_06]
root      3847     1  0 10:57 ?        00:00:00 [ldlm_cb_07]
.
.
.


we can see many processes about ldlm, it's up to hundreds. As a result, the load avarage is too high (155.0 165.0 145.0) to work normally.However, we have no idea and have to restart the nodes. At the same time,we can get the log as follows:
The filesystem features:
 
Server: lustre 1.6.6
Client: lustre 1.6.5

Can someone else get the same problem? 
I will appreciate for your any help!


Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages
Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
                                                                                                                                12745,1       92%
Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages
Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 09:24:10 lxslc22 kernel: LustreError: 30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar messages
Nov  6 09:24:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:24:49 lxslc22 last message repeated 7 times
Nov  6 09:25:04 lxslc22 last message repeated 3 times
Nov  6 09:25:08 lxslc22 kernel: Lustre: Request x1111842 sent from MGC192.168.50.32 at tcp to NID 192.168.50.32 at tcp 500s ago has timed out (limit 500s).
Nov  6 09:25:08 lxslc22 kernel: Lustre: Skipped 29 previous similar messages
Nov  6 09:25:09 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:25:44 lxslc22 last message repeated 7 times
Nov  6 09:26:49 lxslc22 last message repeated 13 times
Nov  6 09:27:09 lxslc22 last message repeated 4 times
Nov  6 09:27:13 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: tried all connections, increasing latency to 51s
Nov  6 09:27:13 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) Skipped 17 previous similar messages
Nov  6 09:27:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 09:27:49 lxslc22 last message repeated 7 times
Nov  6 09:28:54 lxslc22 last message repeated 13 times
Nov  6 09:29:59 lxslc22 last message repeated 13 times
Nov  6 09:31:04 lxslc22 last message repeated 13 times
Nov  6 09:32:09 lxslc22 last message repeated 13 times
Nov  6 09:33:14 lxslc22 last message repeated 13 times
Nov  6 09:34:19 lxslc22 last message repeated 13 times
Nov  6 09:34:54 lxslc22 last message repeated 7 times
Nov  6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Nov  6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar messages
Nov  6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 09:34:55 lxslc22 kernel: LustreError: 30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar messages
Nov  6 10:02:38 lxslc22 kernel: Lustre: Request x1113069 sent from besfs-OST0008-osc-f7e14200 to NID 192.168.50.40 at tcp 500s ago has timed out (limit 500s).
Nov  6 10:02:38 lxslc22 kernel: Lustre: Skipped 18 previous similar messages
Nov  6 10:02:39 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 10:03:14 lxslc22 last message repeated 7 times
Nov  6 10:04:19 lxslc22 last message repeated 13 times
Nov  6 10:04:39 lxslc22 last message repeated 4 times
Nov  6 10:04:43 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200: tried all connections, increasing latency to 51s
Nov  6 10:04:43 lxslc22 kernel: Lustre: 3728:0:(import.c:395:import_select_connection()) Skipped 34 previous similar messages
Nov  6 10:04:44 lxslc22 hm[4390]: Server went down, finding new server.
Nov  6 10:05:19 lxslc22 last message repeated 7 times
Nov  6 10:06:24 lxslc22 last message repeated 13 times
Nov  6 10:06:44 lxslc22 last message repeated 4 times
Nov  6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway
Nov  6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 9 previous similar messages
Nov  6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov  6 10:06:46 lxslc22 kernel: LustreError: 30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 9 previous similar messages


Thanks,
Sarea

2009-11-06 



huangql 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091106/6ea7437f/attachment.htm>


More information about the lustre-discuss mailing list