[lustre-discuss] Repeatable ldlm_enqueue error

Raj rajgautam at gmail.com
Wed Oct 30 19:19:41 PDT 2019


Raj,
Just eyeballing your logs from server and client, it looks like they have
different time. Are they out of sync? It is important to have both clients
and server to have same time.

On Wed, Oct 30, 2019 at 3:37 PM Raj Ayyampalayam <ansraj at gmail.com> wrote:

> Hello,
>
> A particular job (MPI Maker genome annotation) on our cluster produces the
> following error and the job errors out with a "Could not open file error."
> Server: The server is running lustre-2.10.4
> Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result.
> I don't see any other servers (Other MDS and OSS server nodes) reporting
> communication loss to the client. The IB fabric is stable. The job runs to
> completion when using a local storage on the node or a NFS mounted storage.
> The job creates a lot of IO but it does not increase the load on the
> luster servers.
>
> Client:
> Oct 22 14:56:39 n305 kernel: LustreError: 11-0:
> lustre2-MDT0000-mdc-ffff8c3f222c4800: operation ldlm_enqueue to node
> 10.55.49.215 at o2ib failed: rc = -107
> Oct 22 14:56:39 n305 kernel: Lustre: lustre2-MDT0000-mdc-ffff8c3f222c4800:
> Connection to lustre2-MDT0000 (at 10.55.49.215 at o2ib) was lost; in
> progress operations using this service will wait for recovery to complete
> Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages
> Oct 22 14:56:39 n305 kernel: LustreError: 167-0:
> lustre2-MDT0000-mdc-ffff8c3f222c4800: This client was evicted by
> lustre2-MDT0000; in progress operations using this service will fail.
> Oct 22 14:56:39 n305 kernel: LustreError:
> 125851:0:(file.c:172:ll_close_inode_openhandle())
> lustre2-clilmv-ffff8c3f222c4800: inode [0x20000ef38:0xffd6:0x0] mdc close
> failed: rc = -108
> Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar
> message
> Oct 22 14:56:40 n305 kernel: LustreError:
> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID
> [0x20000eedf:0xed9d:0x0] error: rc = -108
> Oct 22 14:56:40 n305 kernel: LustreError:
> 125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout
> [0x20000ef38:0xff55:0x0] error -108.
> Oct 22 14:56:40 n305 kernel: LustreError:
> 125883:0:(ldlm_resource.c:1100:ldlm_resource_complain())
> lustre2-MDT0000-mdc-ffff8c3f222c4800: namespace resource
> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount nonzero (1) after
> lock cleanup; forcing cleanup.
> Oct 22 14:56:40 n305 kernel: LustreError:
> 125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount = 1
> Oct 22 14:56:40 n305 kernel: Lustre: lustre2-MDT0000-mdc-ffff8c3f222c4800:
> Connection restored to 10.55.49.215 at o2ib (at 10.55.49.215 at o2ib)
> Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message
> Oct 22 14:56:40 n305 kernel: LustreError:
> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous
> similar messages
>
> Server:
> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError:
> 7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
> 10.55.14.49 at o2ib) failed to reply to blocking AST (req at ffff881b0e68b900
> x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT0000_UUID
> lock: ffff88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res:
> [0x20000ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags:
> 0x60200400000020 nid: 10.55.14.49 at o2ib remote: 0x3154abaef2786884 expref:
> 72083 pid: 7182 timeout: 16143455124 lvb_type: 0
> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a:
> lustre2-MDT0000: A client on nid 10.55.14.49 at o2ib was evicted due to a
> lock blocking callback time out: rc -110
> mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT0000:
> Connection restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at
> 10.55.14.49 at o2ib)
> mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError:
> 8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
> req at ffff881b0e68b900 x1635734905828176/t0(0)
> o104->lustre2-MDT0000 at 10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0
> ref 1 fl Rpc:/0/ffffffff rc 0/-1
>
>
> Can anyone point me in the right direction on how to debug this issue?
>
> Thanks,
> -Raj
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20191030/477cafb8/attachment.html>


More information about the lustre-discuss mailing list