[Lustre-discuss] Clients losing connection to an OSS.
Christopher J. Walker
C.J.Walker at qmul.ac.uk
Thu Jul 8 10:32:39 PDT 2010
With 1.8.3 clients and 1.8.3 OSSs, a couple of my nodes seem to have
lost connection to an OSS. If I do lfs df, I get the following:
lustre_0-OST0028_UUID: Resource temporarily unavailable
lustre_0-OST0029_UUID: Resource temporarily unavailable
lustre_0-OST002a_UUID: Resource temporarily unavailable
lustre_0-OST002b_UUID: Resource temporarily unavailable
lustre_0-OST002c_UUID 6486115712 3882764932 2603348732 59%
/mnt/lustre_0[OST:44]
lustre_0-OST002d_UUID 6486115712 3797895540 2688209196 58%
/mnt/lustre_0[OST:45]
lustre_0-OST002e_UUID 6486115712 3717364684 2768740788 57%
/mnt/lustre_0[OST:46]
lustre_0-OST002f_UUID 6486115712 3535928996 2950180572 54%
/mnt/lustre_0[OST:47]
This has happened on several machines. Rebooting them seems to cure it.
There are a large number of error messages in the logs - eg:
Jul 7 18:22:14 cn458 kernel: Lustre:
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1340150774596107 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID
10.1.4.121 at tcp 21s ago has timed out (21s prior to deadline).
Jul 7 18:22:14 cn458 kernel: req at ffff8100841ed000
x1340150774596107/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens
368/584 e 0 to 1 dl 1278523334 ref 2 fl Rpc:N/0/0 rc 0/0
Jul 7 18:22:14 cn458 kernel: Lustre:
3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 52 previous
similar messages
Jul 7 18:23:06 cn458 kernel: Lustre:
3816:0:(import.c:517:import_select_connection())
lustre_0-OST0004-osc-ffff81021f55a400: tried all connections, increasing
latency to 19s
Jul 7 18:23:06 cn458 kernel: Lustre:
3816:0:(import.c:517:import_select_connection()) Skipped 58 previous
similar messages
Jul 7 18:26:48 cn458 kernel: Lustre:
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1340150774596722 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID
10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline).
Jul 7 18:26:48 cn458 kernel: req at ffff8101e00d1800
x1340150774596722/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens
368/584 e 0 to 1 dl 1278523608 ref 2 fl Rpc:N/0/0 rc 0/0
Jul 7 18:26:48 cn458 kernel: Lustre:
3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 95 previous
similar messages
Jul 7 18:28:22 cn458 kernel: Lustre:
3816:0:(import.c:517:import_select_connection())
lustre_0-OST0028-osc-ffff81021f55a400: tried all connections, increasing
latency to 25s
Jul 7 18:28:22 cn458 kernel: Lustre:
3816:0:(import.c:517:import_select_connection()) Skipped 84 previous
similar messages
Jul 7 18:35:35 cn458 kernel: Lustre:
3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1340150774597865 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID
10.1.4.121 at tcp 30s ago has timed out (30s prior to deadline).
Jul 7 18:35:35 cn458 kernel: req at ffff8101d66d6800
x1340150774597865/t0 o8->lustre_0-OST0028_UUID at 10.1.4.121@tcp:28/4 lens
368/584 e 0 to 1 dl 1278524135 ref 2 fl Rpc:N/0/0 rc 0/0
Is there a known problem? What information would help debug this?
Chris
PS clients are on bonded 1GigE, servers 10GigE (if that makes a difference).
More information about the lustre-discuss
mailing list