[Lustre-discuss] ksocknal_process_receive() Error -14 / Error -14 on read from ...

Gerd busker at busker.org
Thu Mar 12 08:29:40 PDT 2009


Hi,

We have a 1.6.6 installation using InfiniBand attached DDN OST storage
and OSS'es connected to the network with 10GE adapters.  When running
iozone with ~40 1GE attached clients we see the following on the clients:

Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff8100a01c4000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810050164000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81031b920000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81032192a000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81001b20c000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810128406000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81018c6c2000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810067fce000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff8102a7c62000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 
req at ffff81037f08b000 x35161916/t0
o4->test1-OST0008_UUID at 172.23.125.14@tcp:6/4 lens 384/480 e 0 to 100 dl
1236869066 ref 3 fl Rpc:/0/0 rc 0/0
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:66:request_out_callback()) Skipped 11 previous similar
messages
Mar 12 14:42:46 com01-06 kernel: Lustre: Request x35161916 sent from
test1-OST0008-osc-ffff810324e8e000 to NID 172.23.125.14 at tcp 0s ago has
timed out (limit 100s).
Mar 12 14:42:46 com01-06 kernel: Lustre: Skipped 8 previous similar messages
Mar 12 14:42:46 com01-06 kernel: Lustre:
test1-OST0008-osc-ffff810324e8e000: Connection to service test1-OST0008
via nid 172.23.125.14 at tcp was lost; in progress operations using this
service will wait for recovery to complete.




And this on the OSS:

Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000]
Error -14 on read from 12345-172.23.98.133 at tcp ip 172.23.98.133:1021
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial
receive from 12345-172.23.98.133 at tcp, ip 172.23.98.133:1021, with error
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Skipped 4 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff810049430000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
6699:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk
GET 0(1048576)  req at ffff8100779
2dc50 x35161902/t0
o4->8ec45cac-9f38-63c9-eb19-b4bad0242b73 at NET_0x20000ac176285_UUID:0/0
lens 384/352 e 0 to 0 dl 1236869066 ref 1 fl Interpret:/0/0 rc 0/0
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
6699:0:(ost_handler.c:1153:ost_brw_write()) Skipped 4 previous similar
messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100528b2000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6680:0:(ost_handler.c:1284:ost_brw_write()) test1-OST0010: ignoring bulk
IO comm error with
bfb4f76d-1090-a175-89cd-7f51df10cc68 at NET_0x20000ac17628d_UUID id
12345-172.23.98.141 at tcp - client will retry
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6680:0:(ost_handler.c:1284:ost_brw_write()) Skipped 85 previous similar
messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100633fa000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff81007ea56000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100690ea000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff810044aa0000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:538:target_handle_reconnect()) test1-OST0008:
8ec45cac-9f38-63c9-eb19-b4bad0242b73 reconnecting
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 8 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:773:target_handle_connect()) test1-OST0008: refuse
reconnection from 8ec45cac-9f38-63c9
-eb19-b4bad0242b73 at 172.23.98.133@tcp to 0xffff810023258000; still busy
with 12 active RPCs
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 5 previous
similar messages



What could explain this behaviour?  What is Error 14?

Gerd.




More information about the lustre-discuss mailing list