[Lustre-discuss] help

Brian O'Connor briano at sgi.com
Fri Sep 30 00:09:47 PDT 2011


Hello Ashok

is the cluster hanging or otherwise behaving badly? The logs below show 
that the client
lost connection to 10.148.0.106 for 10seconds or so. It should have 
recovered ok.

If you want further help from the list you need to add more detail about 
the cluster i.e.
A general description of the number of OSS/OST, clients, version of 
lustre etc, and a description
of what is actually going wrong... ie hanging, offline etc

The first thing is to check the infrastructure.. ie. in this case you 
should check your IB network for errors



On 30-September-2011 2:39 PM, Ashok nulguda wrote:
> Dear All,
>
> I am having lustre error on my HPC as given below.Please any one can 
> help me to resolve this problem.
> Thanks in Advance.
> Sep 30 08:40:23 service0 kernel: [343138.837222] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous 
> similar message
> Sep 30 08:40:23 service0 kernel: [343138.837233] Lustre: 
> lustre-OST0008-osc-ffff880b272cf800: Connection to service 
> lustre-OST0008 via nid 10.148.0.106 at o2ib was lost; in progress 
> operations using this service will wait for recovery to complete.
> Sep 30 08:40:24 service0 kernel: [343139.837260] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067288 sent from lustre-OST0006-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:24 service0 kernel: [343139.837263]   
> req at ffff880a5f800c00 x1380984193067288/t0 
> o3->lustre-OST0006_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 
> 1317352224 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:24 service0 kernel: [343139.837269] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 38 previous 
> similar messages
> Sep 30 08:40:24 service0 kernel: [343140.129284] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:24 service0 kernel: [343140.129290] LustreError: 
> 9983:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:24 service0 kernel: [343140.129295] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:24 service0 kernel: [343140.129299] LustreError: 
> 9983:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 previous 
> similar message
> Sep 30 08:40:25 service0 kernel: [343140.837308] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067299 sent from lustre-OST0010-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 7s ago has timed out (7s prior to deadline).
> Sep 30 08:40:25 service0 kernel: [343140.837311]   
> req at ffff880a557c4400 x1380984193067299/t0 
> o3->lustre-OST0010_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 
> 1317352225 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:25 service0 kernel: [343140.837316] Lustre: 
> 8300:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 4 previous 
> similar messages
> Sep 30 08:40:26 service0 kernel: [343141.245365] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:26 service0 kernel: [343141.245371] LustreError: 
> 22729:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:26 service0 kernel: [343141.245378] LustreError: 
> 30978:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Skipped 1 previous 
> similar message
> Sep 30 08:40:33 service0 kernel: [343148.245683] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
> x1380984193067302 sent from lustre-OST0004-osc-ffff880b272cf800 to NID 
> 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline).
> Sep 30 08:40:33 service0 kernel: [343148.245686]   
> req at ffff8805c879e800 x1380984193067302/t0 
> o103->lustre-OST0004_UUID at 10.148.0.106@o2ib:17/18 lens 296/384 e 0 to 
> 1 dl 1317352233 ref 1 fl Rpc:N/0/0 rc 0/0
> Sep 30 08:40:33 service0 kernel: [343148.245692] Lustre: 
> 22725:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 2 previous 
> similar messages
> Sep 30 08:40:33 service0 kernel: [343148.245708] LustreError: 
> 22725:0:(ldlm_request.c:1025:ldlm_cli_cancel_req()) Got rc -11 from 
> cancel RPC: canceling anyway
> Sep 30 08:40:33 service0 kernel: [343148.245714] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) 
> ldlm_cli_cancel_list: -11
> Sep 30 08:40:33 service0 kernel: [343148.245717] LustreError: 
> 22725:0:(ldlm_request.c:1587:ldlm_cli_cancel_list()) Skipped 1 
> previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548005] LustreError: 11-0: an 
> error occurred while communicating with 10.148.0.106 at o2ib. The 
> ost_connect operation failed with -16
> Sep 30 08:40:36 service0 kernel: [343151.548008] LustreError: Skipped 
> 1 previous similar message
> Sep 30 08:40:36 service0 kernel: [343151.548024] LustreError: 167-0: 
> This client was evicted by lustre-OST000b; in progress operations 
> using this service will fail.
> Sep 30 08:40:36 service0 kernel: [343151.548250] LustreError: 
> 30452:0:(llite_mmap.c:210:ll_tree_unlock()) couldn't unlock -5
> Sep 30 08:40:36 service0 kernel: [343151.550210] LustreError: 
> 8300:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
> req at ffff88049528c400 x1380984193067406/t0 
> o3->lustre-OST000b_UUID at 10.148.0.106@o2ib:6/4 lens 448/592 e 0 to 1 dl 
> 0 ref 2 fl Rpc:/0/0 rc 0/0
> Sep 30 08:40:36 service0 kernel: [343151.594742] Lustre: 
> lustre-OST0000-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0000 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:36 service0 kernel: [343151.837203] Lustre: 
> lustre-OST0006-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0006 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842631] Lustre: 
> lustre-OST0003-osc-ffff880b272cf800: Connection restored to service 
> lustre-OST0003 using nid 10.148.0.106 at o2ib.
> Sep 30 08:40:37 service0 kernel: [343152.842636] Lustre: Skipped 3 
> previous similar messages
>
>
> Thanks and Regards
> Ashok
>
> -- 
> *Ashok Nulguda
> *
> *TATA ELXSI LTD*
> *Mb : +91 9689945767
> *
> *Email :ashokn at tataelxsi.co.in <mailto:tshrikant at tataelxsi.co.in>*
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


-- 
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: briano at sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110930/c142be61/attachment.htm>


More information about the lustre-discuss mailing list