[Lustre-discuss] Clients losing connection to an OSS.

Christopher J. Walker C.J.Walker at qmul.ac.uk
Thu Jul 8 10:32:39 PDT 2010


With 1.8.3 clients and 1.8.3 OSSs, a couple of my nodes seem to have 
lost connection to an OSS. If I do lfs df, I get the following:


lustre_0-OST0028_UUID: Resource temporarily unavailable
lustre_0-OST0029_UUID: Resource temporarily unavailable
lustre_0-OST002a_UUID: Resource temporarily unavailable
lustre_0-OST002b_UUID: Resource temporarily unavailable
lustre_0-OST002c_UUID  6486115712  3882764932  2603348732  59% 
/mnt/lustre_0[OST:44]
lustre_0-OST002d_UUID  6486115712  3797895540  2688209196  58% 
/mnt/lustre_0[OST:45]
lustre_0-OST002e_UUID  6486115712  3717364684  2768740788  57% 
/mnt/lustre_0[OST:46]
lustre_0-OST002f_UUID  6486115712  3535928996  2950180572  54% 
/mnt/lustre_0[OST:47]

This has happened on several machines. Rebooting them seems to cure it.

There are a large number of error messages in the logs - eg:

Jul  7 18:22:14 cn458 kernel: Lustre: 
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1340150774596107 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 
10.1.4.121 at tcp 21s ago has timed out (21s prior to deadline).
Jul  7 18:22:14 cn458 kernel:   req at ffff8100841ed000 
x1340150774596107/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 
368/584 e 0 to 1 dl 1278523334 ref 2 fl Rpc:N/0/0 rc 0/0
Jul  7 18:22:14 cn458 kernel: Lustre: 
3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 52 previous 
similar messages
Jul  7 18:23:06 cn458 kernel: Lustre: 
3816:0:(import.c:517:import_select_connection()) 
lustre_0-OST0004-osc-ffff81021f55a400: tried all connections, increasing 
latency to 19s
Jul  7 18:23:06 cn458 kernel: Lustre: 
3816:0:(import.c:517:import_select_connection()) Skipped 58 previous 
similar messages
Jul  7 18:26:48 cn458 kernel: Lustre: 
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1340150774596722 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 
10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline).
Jul  7 18:26:48 cn458 kernel:   req at ffff8101e00d1800 
x1340150774596722/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 
368/584 e 0 to 1 dl 1278523608 ref 2 fl Rpc:N/0/0 rc 0/0
Jul  7 18:26:48 cn458 kernel: Lustre: 
3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 95 previous 
similar messages
Jul  7 18:28:22 cn458 kernel: Lustre: 
3816:0:(import.c:517:import_select_connection()) 
lustre_0-OST0028-osc-ffff81021f55a400: tried all connections, increasing 
latency to 25s
Jul  7 18:28:22 cn458 kernel: Lustre: 
3816:0:(import.c:517:import_select_connection()) Skipped 84 previous 
similar messages
Jul  7 18:35:35 cn458 kernel: Lustre: 
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1340150774597865 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 
10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline).
Jul  7 18:35:35 cn458 kernel:   req at ffff8101d66d6800 
x1340150774597865/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens 
368/584 e 0 to 1 dl 1278524135 ref 2 fl Rpc:N/0/0 rc 0/0



Is there a known problem? What information would help debug this?

Chris

PS clients are on bonded 1GigE, servers 10GigE (if that makes a difference).



More information about the lustre-discuss mailing list