[Lustre-discuss] 1.6.4.1 - active client evicted

Niklas Edmundsson Niklas.Edmundsson at hpc2n.umu.se
Thu Jan 10 04:35:43 PST 2008


Hi!

We've started to poke and prod at Lustre 1.6.4.1, and it seems to 
mostly work (we haven't had it OOPS on us yet like the earlier 
1.6-versions did).

However, we had this weird incident where an active client (it was 
copying 4GB files and running ls at the time) got evicted by the MDS 
and all OST's. After a while logs indicate that it did recover the 
connection to the MDT, but not the OST's ...

lctl ping between the client and mds/oss's still worked.

After a while I got tired of waiting for it to recover and did
umount -f, which also hung.

So, what should happen in situations like this? What can I do to debug 
it? Has anyone else seen this?

130.239.78.238 is the client, 130.239.78.233 is the MGS/MDT.

Logs from the client:
----------------------------8<------------------------
Jan 10 12:40:38 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107
Jan 10 12:40:38 LustreError: Skipped 1 previous similar message
Jan 10 12:40:38 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 10 12:40:38 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail.
Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Jan 10 12:40:38 LustreError: 7975:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100c5298800 x649493/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 432/912 ref 1 fl Rpc:/0/0 rc 0/0
Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
Jan 10 12:41:40 LustreError: 7979:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81005ee12e00 x649521/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
Jan 10 12:41:40 LustreError: 7979:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff810043ad4a00 x649532/t0 o35->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 296/784 ref 1 fl Rpc:/0/0 rc 0/0
Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421378 mdc close failed: rc = -108
Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421383 mdc close failed: rc = -108
Jan 10 12:42:18 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp.
Jan 10 12:42:48 LustreError: 11-0: an error occurred while communicating with 130.239.78.235 at tcp. The ost_setattr operation failed with -107
Jan 10 12:42:48 Lustre: hpfs-OST0007-osc-ffff8100016d2c00: Connection to service hpfs-OST0007 via nid 130.239.78.235 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 10 12:42:48 LustreError: 167-0: This client was evicted by hpfs-OST0007; in progress operations using this service will fail.
Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107
Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5
Jan 10 12:42:49 LustreError: 8001:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5
Jan 10 12:42:50 LustreError: 11-0: an error occurred while communicating with 130.239.78.239 at tcp. The ost_setattr operation failed with -107
Jan 10 12:42:50 Lustre: hpfs-OST0000-osc-ffff8100016d2c00: Connection to service hpfs-OST0000 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 10 12:42:50 LustreError: 167-0: This client was evicted by hpfs-OST0000; in progress operations using this service will fail.
Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107
Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) Skipped 1 previous similar message
Jan 10 12:42:52 LustreError: 11-0: an error occurred while communicating with 130.239.78.236 at tcp. The ost_setattr operation failed with -107
Jan 10 12:42:52 Lustre: hpfs-OST0002-osc-ffff8100016d2c00: Connection to service hpfs-OST0002 via nid 130.239.78.236 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 10 12:42:52 LustreError: 167-0: This client was evicted by hpfs-OST0002; in progress operations using this service will fail.
Jan 10 12:50:41 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107
Jan 10 12:50:41 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete.
Jan 10 12:50:41 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail.
Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message
Jan 10 12:50:41 LustreError: 8017:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377
Jan 10 12:50:41 LustreError: 8016:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377
Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100f650e800 x649712/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message
Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message
Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100f650e800 x649715/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
Jan 10 12:52:21 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp.
Jan 10 12:57:46 Lustre: setting import hpfs-MDT0000_UUID INACTIVE by administrator request
Jan 10 12:59:26 Lustre: setting import hpfs-OST0000_UUID INACTIVE by administrator request
----------------------------8<------------------------

Logs from the MGS/MDT:
----------------------------8<------------------------
Jan 10 12:20:31 Lustre: MGS: haven't heard from client 01a8bcfc-fd98-90a9-6aeb-7c331a658b2e (at 130.239.78.238 at tcp) in 233 seconds. I think it's dead, and I am evicting it.
Jan 10 12:21:11 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 271 seconds. I think it's dead, and I am evicting it.
Jan 10 12:40:38 LustreError: 27332:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp
Jan 10 12:40:38 LustreError: 27332:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff8100e20ffe00 x649491/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0
Jan 10 12:47:26 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 274 seconds. I think it's dead, and I am evicting it.
Jan 10 12:50:41 LustreError: 27322:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp
Jan 10 12:50:41 LustreError: 27322:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff8101fbdaf000 x649703/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0
Jan 10 12:57:26 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 251 seconds. I think it's dead, and I am evicting it.
----------------------------8<------------------------

Logs from one of the OSS's (serves two OST's). Logs similar on all
OSS's:
----------------------------8<------------------------
Jan 10 12:20:46 Lustre: hpfs-OST0002: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 246 seconds. I think it's dead, and I am evicting it.
Jan 10 12:20:56 Lustre: hpfs-OST0003: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 256 seconds. I think it's dead, and I am evicting it.
Jan 10 12:42:52 LustreError: 6665:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff810081af2400 x649565/t0 o2-><?>@<?>:-1 lens 336/0 ref 0 fl Interpret:/0/0 rc -107/0
Jan 10 12:47:01 Lustre: hpfs-OST0002: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 249 seconds. I think it's dead, and I am evicting it.
----------------------------8<------------------------


/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  The magic of Windows - turn a 486-50 into a 4MHz XT...
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=




More information about the lustre-discuss mailing list