[Lustre-discuss] Recovery Problem

Andreas Dilger andreas.dilger at oracle.com
Wed May 19 08:07:19 PDT 2010


More important is to include the crash message from the client and the  
version of Lustre you are using.

Cheers, Andreas

On 2010-05-19, at 6:34, Stefano Elmopi <stefano.elmopi at sociale.it>  
wrote:

>
>
> Hi,
>
> I have a small problem but it certainly is the fault of the little  
> knowledge I have by the argument.
> I have a Lustre file system with a node MGS/MDS, two nodes OSS and  
> one Client.
> I launch a copy of a large file on Lustre and while the copy goes on,
> I restart the node OSS that is handling the writing on the File  
> System.
> The copy process is put in the state -stalled- and when the node OSS  
> is back on,
> I expected the copy process to resume normally, but instead crashes.
> This is a log on the node MGS:
>
> May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago has  
> timed out (17s prior to deadline).
> May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400  
> x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
> Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp  
> was lost; in progress operations using this service will wait for  
> recovery to complete.
> May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago has  
> timed out (26s prior to deadline).
> May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000  
> x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 2s
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
> May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
> 66:request_out_callback()) @@@ type 4, status -113   
> req at ffff81012d3e5800 x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has  
> failed due to network error (27s prior to deadline).
> May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800  
> x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 3s
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
> 2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
> May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
> 66:request_out_callback()) @@@ type 4, status -113   
> req at ffff81012e11e400 x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
> @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc  
> 0/0
> May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
> 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent  
> from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has  
> failed due to network error (28s prior to deadline).
> May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400  
> x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
> 28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
> 517:import_select_connection()) lustre01-OST0000-osc: tried all  
> connections, increasing latency to 4s
> May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
> was evicted by lustre01-OST0000; in progress operations using this  
> service will fail.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 
> 1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota  
> recovery
> May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
> Connection restored to service lustre01-OST0000 using nid 172.16.100.121 
> @tcp.
> May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
> lustre01-OST0000_UUID now active, resetting orphans
>
> is a timeout problem ??
> How can I change the timeout ?
>
> Thanks !!!
>
>
>
> Ing. Stefano Elmopi
> Gruppo Darco - Resp. ICT Sistemi
> Via Ostiense 131/L Corpo B, 00154 Roma
>
> cell. 3466147165
> tel.  0657060500
> email:stefano.elmopi at sociale.it
>
> "Ai sensi e per effetti della legge sulla tutela  della   
> riservatezza personale
> (D.lgs n. 196/2003),  questa @mail e' destinata  unicamente alle  
> persone sopra
> indicate e le informazioni in essa contenute sono da considerarsi  
> strettamente
> riservate. E' proibito leggere, copiare, usare o diffondere il  
> contenuto della
> presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
> messaggio per
> errore, siete pregati di rispedire la stessa al mittente. Grazie"
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100519/b6554b18/attachment.htm>


More information about the lustre-discuss mailing list