[Lustre-discuss] 1.6.4.1 - active client evicted

Niklas Edmundsson Niklas.Edmundsson at hpc2n.umu.se
Thu Jan 10 06:45:35 PST 2008


On Thu, 10 Jan 2008, Niklas Edmundsson wrote:

> However, we had this weird incident where an active client (it was
> copying 4GB files and running ls at the time) got evicted by the MDS
> and all OST's. After a while logs indicate that it did recover the
> connection to the MDT, but not the OST's ...
>
> lctl ping between the client and mds/oss's still worked.
>

> umount -f, which also hung.

After I rebooted the client it recovered. However, I got this 
rather suspicious log message on the MDS (similar on the OSS's):
Jan 10 14:11:43 Lustre: 27208:0:(router.c:167:lnet_notify()) Ignoring prediction from 130.239.78.233 at tcp of 130.239.78.238 at tcp down 4829687047 seconds in the future



> So, what should happen in situations like this? What can I do to debug
> it? Has anyone else seen this?
>
> 130.239.78.238 is the client, 130.239.78.233 is the MGS/MDT.
>
> Logs from the client:
> ----------------------------8<------------------------
> Jan 10 12:40:38 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107
> Jan 10 12:40:38 LustreError: Skipped 1 previous similar message
> Jan 10 12:40:38 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Jan 10 12:40:38 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail.
> Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
> Jan 10 12:40:38 LustreError: 7975:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100c5298800 x649493/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 432/912 ref 1 fl Rpc:/0/0 rc 0/0
> Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
> Jan 10 12:41:40 LustreError: 7979:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff81005ee12e00 x649521/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
> Jan 10 12:41:40 LustreError: 7979:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
> Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff810043ad4a00 x649532/t0 o35->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 296/784 ref 1 fl Rpc:/0/0 rc 0/0
> Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message
> Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421378 mdc close failed: rc = -108
> Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421383 mdc close failed: rc = -108
> Jan 10 12:42:18 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp.
> Jan 10 12:42:48 LustreError: 11-0: an error occurred while communicating with 130.239.78.235 at tcp. The ost_setattr operation failed with -107
> Jan 10 12:42:48 Lustre: hpfs-OST0007-osc-ffff8100016d2c00: Connection to service hpfs-OST0007 via nid 130.239.78.235 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Jan 10 12:42:48 LustreError: 167-0: This client was evicted by hpfs-OST0007; in progress operations using this service will fail.
> Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107
> Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5
> Jan 10 12:42:49 LustreError: 8001:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5
> Jan 10 12:42:50 LustreError: 11-0: an error occurred while communicating with 130.239.78.239 at tcp. The ost_setattr operation failed with -107
> Jan 10 12:42:50 Lustre: hpfs-OST0000-osc-ffff8100016d2c00: Connection to service hpfs-OST0000 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Jan 10 12:42:50 LustreError: 167-0: This client was evicted by hpfs-OST0000; in progress operations using this service will fail.
> Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107
> Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) Skipped 1 previous similar message
> Jan 10 12:42:52 LustreError: 11-0: an error occurred while communicating with 130.239.78.236 at tcp. The ost_setattr operation failed with -107
> Jan 10 12:42:52 Lustre: hpfs-OST0002-osc-ffff8100016d2c00: Connection to service hpfs-OST0002 via nid 130.239.78.236 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Jan 10 12:42:52 LustreError: 167-0: This client was evicted by hpfs-OST0002; in progress operations using this service will fail.
> Jan 10 12:50:41 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107
> Jan 10 12:50:41 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete.
> Jan 10 12:50:41 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail.
> Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5
> Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message
> Jan 10 12:50:41 LustreError: 8017:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377
> Jan 10 12:50:41 LustreError: 8016:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377
> Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100f650e800 x649712/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
> Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message
> Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108
> Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message
> Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8100f650e800 x649715/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0
> Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages
> Jan 10 12:52:21 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp.
> Jan 10 12:57:46 Lustre: setting import hpfs-MDT0000_UUID INACTIVE by administrator request
> Jan 10 12:59:26 Lustre: setting import hpfs-OST0000_UUID INACTIVE by administrator request
> ----------------------------8<------------------------
>
> Logs from the MGS/MDT:
> ----------------------------8<------------------------
> Jan 10 12:20:31 Lustre: MGS: haven't heard from client 01a8bcfc-fd98-90a9-6aeb-7c331a658b2e (at 130.239.78.238 at tcp) in 233 seconds. I think it's dead, and I am evicting it.
> Jan 10 12:21:11 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 271 seconds. I think it's dead, and I am evicting it.
> Jan 10 12:40:38 LustreError: 27332:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp
> Jan 10 12:40:38 LustreError: 27332:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff8100e20ffe00 x649491/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0
> Jan 10 12:47:26 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 274 seconds. I think it's dead, and I am evicting it.
> Jan 10 12:50:41 LustreError: 27322:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp
> Jan 10 12:50:41 LustreError: 27322:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff8101fbdaf000 x649703/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0
> Jan 10 12:57:26 Lustre: hpfs-MDT0000: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 251 seconds. I think it's dead, and I am evicting it.
> ----------------------------8<------------------------
>
> Logs from one of the OSS's (serves two OST's). Logs similar on all
> OSS's:
> ----------------------------8<------------------------
> Jan 10 12:20:46 Lustre: hpfs-OST0002: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 246 seconds. I think it's dead, and I am evicting it.
> Jan 10 12:20:56 Lustre: hpfs-OST0003: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 256 seconds. I think it's dead, and I am evicting it.
> Jan 10 12:42:52 LustreError: 6665:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107)  req at ffff810081af2400 x649565/t0 o2-><?>@<?>:-1 lens 336/0 ref 0 fl Interpret:/0/0 rc -107/0
> Jan 10 12:47:01 Lustre: hpfs-OST0002: haven't heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 249 seconds. I think it's dead, and I am evicting it.
> ----------------------------8<------------------------
>
>
> /Nikke
>


/Nikke
-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
  Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se     |    nikke at hpc2n.umu.se
---------------------------------------------------------------------------
  Jimmy Hoffa is buried here ------> X
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=




More information about the lustre-discuss mailing list