[Lustre-discuss] Recovery Problem
Stefano Elmopi
stefano.elmopi at sociale.it
Wed May 19 05:34:17 PDT 2010
Hi,
I have a small problem but it certainly is the fault of the little
knowledge I have by the argument.
I have a Lustre file system with a node MGS/MDS, two nodes OSS and one
Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File System.
The copy process is put in the state -stalled- and when the node OSS
is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:
May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 17s ago has timed
out (17s prior to deadline).
May 19 13:43:43 mdt01prdpom kernel: req at ffff81012e11e400
x1336168048230433/t0 o400->lustre01-OST0000_UUID at 172.16.100.121@tcp:
28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100.121 at tcp was
lost; in progress operations using this service will wait for recovery
to complete.
May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 26s ago has timed
out (26s prior to deadline).
May 19 13:44:09 mdt01prdpom kernel: req at ffff81012e5f2000
x1336168048230435/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 2s
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
req at ffff81012d3e5800 x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed
due to network error (27s prior to deadline).
May 19 13:44:37 mdt01prdpom kernel: req at ffff81012d3e5800
x1336168048230437/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 3s
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100.121 at tcp: -113
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
req at ffff81012e11e400 x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent
from lustre01-OST0000-osc to NID 172.16.100.121 at tcp 0s ago has failed
due to network error (28s prior to deadline).
May 19 13:45:33 mdt01prdpom kernel: req at ffff81012e11e400
x1336168048230441/t0 o8->lustre01-OST0000_UUID at 172.16.100.121@tcp:28/4
lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 4s
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c:
1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota
recovery
May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100.121 at tcp.
May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
is a timeout problem ??
How can I change the timeout ?
Thanks !!!
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:stefano.elmopi at sociale.it
"Ai sensi e per effetti della legge sulla tutela della riservatezza
personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100519/17c64d14/attachment.htm>
More information about the lustre-discuss
mailing list