[Lustre-discuss] Recovery Problem

Stefano Elmopi stefano.elmopi at sociale.it
Wed May 19 05:34:17 PDT 2010



Hi,

I have a small problem but it certainly is the fault of the little  
knowledge I have by the argument.
I have a Lustre file system with a node MGS/MDS, two nodes OSS and one  
Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File System.
The copy process is put in the state -stalled- and when the node OSS  
is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:

May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago has timed  
out (17s prior to deadline).
May 19 13:43:43 mdt01prdpom kernel:   req at ffff81012e11e400  
x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121@tcp: 
28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was  
lost; in progress operations using this service will wait for recovery  
to complete.
May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago has timed  
out (26s prior to deadline).
May 19 13:44:09 mdt01prdpom kernel:   req at ffff81012e5f2000  
x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 2s
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81012d3e5800 x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (27s prior to deadline).
May 19 13:44:37 mdt01prdpom kernel:   req at ffff81012d3e5800  
x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 3s
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 
66:request_out_callback()) @@@ type 4, status -113   
req at ffff81012e11e400 x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121 
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent  
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed  
due to network error (28s prior to deadline).
May 19 13:45:33 mdt01prdpom kernel:   req at ffff81012e11e400  
x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4  
lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 
517:import_select_connection()) lustre01-OST0000-osc: tried all  
connections, increasing latency to 4s
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client  
was evicted by lustre01-OST0000; in progress operations using this  
service will fail.
May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 
1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota  
recovery
May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:  
Connection restored to service lustre01-OST0000 using nid  
172.16.100.121 at tcp.
May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:  
lustre01-OST0000_UUID now active, resetting orphans

is a timeout problem ??
How can I change the timeout ?

Thanks !!!



Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:stefano.elmopi at sociale.it

"Ai sensi e per effetti della legge sulla tutela  della  riservatezza  
personale
(D.lgs n. 196/2003),  questa @mail e' destinata  unicamente alle  
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi  
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il  
contenuto della
presente @mail  senza  autorizzazione. Se avete ricevuto  questo  
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100519/17c64d14/attachment.htm>


More information about the lustre-discuss mailing list