[Lustre-discuss] Aborting recovery

Fri Mar 6 11:09:25 PST 2009

Brian J. Murrell wrote:
> On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote:
>> Thanks Brian.
> 
> NP.
> 
>> What I meant: the average batch job that wants to read from or write to
>> Lustre will abort if a file cannot be accessed. The reason doesn't
>> matter to the jobs or the user.
> 
> That may be so, but what I am saying is that when a lustre client wants
> to perform an i/o operation on behalf of an application running on that
> machine and the target it wants to do the i/o with is down, the lustre
> client will wait and block the applications i/o indefinitely.
> 
> That means that unless the application has some kind of timer in it so
> that it can abort the read(2)/write(2), it will wait forever as the
> read(2) or write(2) system call that it issued will simply wait for the
> lustre client to complete -- forever, if the target that the lustre
> client wanted to do the i/o with never comes back.

But this is not what our users observe. Even on an otherwise perfectly
working system, they report I/O errors on access to some files. This is
distributed randomly over time, servers, clients and jobs, and the files
are always there and undamaged when checking them  afterwards.
I  can usually see something happening in the logs of OST and client:
The OST starts with "timeout on bulk PUT after 6+0s", which the OST is
first "ignoring bulk IO comm error" in the hope that "client will
retry". But then the OST says "Request x49132126 took longer than
estimated (7+5s); client may timeout." Of course I might be falsely
connecting error messages that are independent of each other.
The client has a log that is easier to read, "Request ... has timed out
(limit 7s)", "Connection to service was lost; in progress operations
using this service will fail", finally "Connection restored to service".

I guess I should append the full lines to this mail.
Anyhow, in these minutes the user's job reports "SysError: error reading
from file..." - and aborts.

The network connection between client and OSS can be trusted to be
extremely buggy, but interruptions should only be on the order of
seconds. So if the client were just blocking the applications I/O, it
should smoothly get over these? This restoration of connection happens 1
sec after the first error message, so both sides can see each other
again immediately.

-------------------------------------------
-------------------------------------------
kern.log on the OSS:
-------------------------------------------
Mar  6 11:23:49 OSS100 kernel: [14066606.171264] LustreError:
15145:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk
 PUT after 6+0s  req at ffff81021fa8d938 x93767754/t0
o3->a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID:0/0 lens
 384/336 e 0 to 0 dl 1236335029 ref 1 fl Interpret:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.615287] Lustre:
14771:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027:
 1011add7-13d1-0efc-65e3-14f94eab88c4 reconnecting
Mar  6 11:23:50 OSS100 kernel: [14066607.616000] Lustre:
15112:0:(ost_handler.c:1270:ost_brw_write()) gsilust-OST0027: ignor
ing bulk IO comm error with
6bb22b4d-3509-6c5d-ab95-85377cf88ade at NET_0x200008cb56a8a_UUID id
12345-CLIENT_A at tcp - client will retry
Mar  6 11:23:50 OSS100 kernel: [14066607.616105] Lustre:
15112:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request
 x98164002 took longer than estimated (7+4s); client may timeout.
req at ffff8101557cbad0 x98164002/t0 o4->6bb22b4d-3509-6c5d-a
b95-85377cf88ade at NET_0x200008cb56a8a_UUID:0/0 lens 384/352 e 0 to 0 dl
1236335026 ref 1 fl Complete:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.616492] LustreError:
21161:0:(service.c:568:ptlrpc_check_req()) @@@ DROPPING req fr
om old connection 1 < 2  req at ffff81015fc066e0 x520348/t0
o13->gsilust-mdtlov_UUID at NET_0x200008cb57036_UUID:0/0 lens 128/0 e 0 to
0 dl 0 ref 1 fl New:/0/0 rc 0/0
Mar  6 11:23:50 OSS100 kernel: [14066607.616591] Lustre:
15145:0:(ost_handler.c:925:ost_brw_read()) gsilust-OST0027: ignorin
g bulk IO comm error with
a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 at NET_0x200008cb56e9e_UUID id
12345-CLIENT_B at tcp - client will retry
Mar  6 11:23:50 OSS100 kernel: [14066607.623479] Lustre:
15094:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request
 x49132126 took longer than estimated (7+5s); client may timeout.
req at ffff810152edd2b0 x49132126/t0 o3->fe639162-9537-9ab5-3
6f5-fba5c0a43171 at NET_0x200008cb572f6_UUID:0/0 lens 384/336 e 0 to 0 dl
1236335025 ref 1 fl Complete:/0/0 rc 0/0
Mar  6 11:24:24 OSS100 kernel: [14066641.871733] Lustre:
14753:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027:
 a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 reconnecting

-------------------------------------------
-------------------------------------------
kern.log on CLIENT_A:
-------------------------------------------

 Mar  6 11:24:24 CLIENT_A kernel: Lustre: Request x93767754 sent from
gsilust-OST0027-osc-ffff810808e68c00 to NID OSS100 at tcp 41s ago has timed
out (limit 7s).
Mar  6 11:24:24 CLIENT_A kernel: LustreError: 166-1:
gsilust-OST0027-osc-ffff810808e68c00: Connection to service
gsilust-OST0027 via nid OSS100 at tcp was lost; in progress operations
using this service will fail.
Mar  6 11:24:24 CLIENT_A kernel: LustreError: 167-0: This client was
evicted by gsilust-OST0027; in progress operations using this service
will fail.
Mar  6 11:24:25 CLIENT_A kernel: Lustre:
gsilust-OST0027-osc-ffff810808e68c00: Connection restored to service
gsilust-OST0027 u
sing nid OSS100 at tcp.

--------------------------------------------------------------------------

Regards,
Thomas