[Lustre-discuss] Lustre client lockups

Kurt Dillen kdillen at gmail.com
Tue Nov 4 09:06:31 PST 2008


Hello all,

We have a serious problem with lustre.  Since a few days we have
lockups on the client side.  Not all clients are having this
problem.

We are running this kernel  2.6.16-54-0.2.5_lustre.1.6.4.3smp.

The statahead disable is done on the systems.

Some more information about the environment:

- Lustre clients are all vmware virtual systems
- Lustre Farm are all vmware virtual systems

the errors I see are the following:

LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e5dca000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e519e000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e4e0a000
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e86b1bc0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e79fe5c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e70a88c0
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e7081280
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100e6d6d5c0
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816920, 100s ago)  req at ffff8100e7e2ba00 x17940/t0
o4->lustre-OST0005_UUID at 172.16.0.29@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection to service
lustre-OST0005 via nid 172.16.0.29 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0005-osc-ffff8100e8551800: Connection restored to
service lustre-OST0005 using nid 172.16.0.29 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816924, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/0/0 rc 0/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
2 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225816953, 100s ago)  req at ffff8100e7e2d800 x20560/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
0/0 rc 0/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817024, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817053, 100s ago)  req at ffff8100e7e2d800 x20724/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817124, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817153, 100s ago)  req at ffff8100e7e2d800 x20767/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817224, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817324, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 1 previous similar message
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817424, 100s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 1 previous similar message
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817553, 100s ago)  req at ffff8100e7e2d800 x20952/t0
o4->lustre-OST0006_UUID at 172.16.0.30@tcp:28 lens 384/352 ref 2 fl Rpc:/
2/0 rc -11/-22
LustreError: 3428:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
2 previous similar messages
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection to service
lustre-OST0006 via nid 172.16.0.30 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 2 previous similar messages
Lustre: lustre-OST0006-osc-ffff8100e8551800: Connection restored to
service lustre-OST0006 using nid 172.16.0.30 at tcp.
Lustre: Skipped 2 previous similar messages
LustreError: 3420:0:(events.c:134:client_bulk_callback()) event type
0, status -5, desc ffff8100efba6800
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) @@@
timeout (sent at 1225817824, 99s ago)  req at ffff8100e64b3a00 x19702/t0
o36->lustre-MDT0000_UUID at 172.16.0.22@tcp:12 lens 1544/296 ref 1 fl
Rpc:/2/0 rc -11/-22
LustreError: 3602:0:(client.c:975:ptlrpc_expire_one_request()) Skipped
4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection to service
lustre-MDT0000 via nid 172.16.0.22 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 4 previous similar messages
Lustre: lustre-MDT0000-mdc-ffff8100e8551800: Connection restored to
service lustre-MDT0000 using nid 172.16.0.22 at tcp.
Lustre: Skipped 4 previous similar messages

Could somebody help me out ?

Thanks in advance.

Kurt



More information about the lustre-discuss mailing list