[lustre-discuss] Repeatable ldlm_enqueue error

Raj Ayyampalayam ansraj at gmail.com
Thu Oct 31 06:47:12 PDT 2019


I had the same thought and I checked all the nodes, and they were all
exactly the same time.

Raj

On Wed, Oct 30, 2019, 10:19 PM Raj <rajgautam at gmail.com> wrote:

> Raj,
> Just eyeballing your logs from server and client, it looks like they have
> different time. Are they out of sync? It is important to have both clients
> and server to have same time.
>
> On Wed, Oct 30, 2019 at 3:37 PM Raj Ayyampalayam <ansraj at gmail.com> wrote:
>
>> Hello,
>>
>> A particular job (MPI Maker genome annotation) on our cluster produces
>> the following error and the job errors out with a "Could not open file
>> error."
>> Server: The server is running lustre-2.10.4
>> Client: I've tried it with 2.10.5, 2.10.8 and 2.12.3 with the same result.
>> I don't see any other servers (Other MDS and OSS server nodes) reporting
>> communication loss to the client. The IB fabric is stable. The job runs to
>> completion when using a local storage on the node or a NFS mounted storage.
>> The job creates a lot of IO but it does not increase the load on the
>> luster servers.
>>
>> Client:
>> Oct 22 14:56:39 n305 kernel: LustreError: 11-0:
>> lustre2-MDT0000-mdc-ffff8c3f222c4800: operation ldlm_enqueue to node
>> 10.55.49.215 at o2ib failed: rc = -107
>> Oct 22 14:56:39 n305 kernel: Lustre:
>> lustre2-MDT0000-mdc-ffff8c3f222c4800: Connection to lustre2-MDT0000 (at
>> 10.55.49.215 at o2ib) was lost; in progress operations using this service
>> will wait for recovery to complete
>> Oct 22 14:56:39 n305 kernel: Lustre: Skipped 2 previous similar messages
>> Oct 22 14:56:39 n305 kernel: LustreError: 167-0:
>> lustre2-MDT0000-mdc-ffff8c3f222c4800: This client was evicted by
>> lustre2-MDT0000; in progress operations using this service will fail.
>> Oct 22 14:56:39 n305 kernel: LustreError:
>> 125851:0:(file.c:172:ll_close_inode_openhandle())
>> lustre2-clilmv-ffff8c3f222c4800: inode [0x20000ef38:0xffd6:0x0] mdc close
>> failed: rc = -108
>> Oct 22 14:56:39 n305 kernel: LustreError: Skipped 1 previous similar
>> message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) lustre2: revalidate FID
>> [0x20000eedf:0xed9d:0x0] error: rc = -108
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125665:0:(vvp_io.c:1474:vvp_io_init()) lustre2: refresh file layout
>> [0x20000ef38:0xff55:0x0] error -108.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1100:ldlm_resource_complain())
>> lustre2-MDT0000-mdc-ffff8c3f222c4800: namespace resource
>> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount nonzero (1) after
>> lock cleanup; forcing cleanup.
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125883:0:(ldlm_resource.c:1682:ldlm_resource_dump()) --- Resource:
>> [0x20000ef38:0xff55:0x0].0x0 (ffff8bdc6823c9c0) refcount = 1
>> Oct 22 14:56:40 n305 kernel: Lustre:
>> lustre2-MDT0000-mdc-ffff8c3f222c4800: Connection restored to
>> 10.55.49.215 at o2ib (at 10.55.49.215 at o2ib)
>> Oct 22 14:56:40 n305 kernel: Lustre: Skipped 1 previous similar message
>> Oct 22 14:56:40 n305 kernel: LustreError:
>> 125959:0:(file.c:3644:ll_inode_revalidate_fini()) Skipped 2 previous
>> similar messages
>>
>> Server:
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError:
>> 7182:0:(ldlm_lockd.c:697:ldlm_handle_ast_error()) ### client (nid
>> 10.55.14.49 at o2ib) failed to reply to blocking AST (req at ffff881b0e68b900
>> x1635734905828112 status 0 rc -110), evict it ns: mdt-lustre2-MDT0000_UUID
>> lock: ffff88187ec45e00/0x121438a5db957b5 lrc: 4/0,0 mode: PR/PR res:
>> [0x20000ef38:0xffec:0x0].0x0 bits 0x20 rrc: 4 type: IBT flags:
>> 0x60200400000020 nid: 10.55.14.49 at o2ib remote: 0x3154abaef2786884
>> expref: 72083 pid: 7182 timeout: 16143455124 lvb_type: 0
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: LustreError: 138-a:
>> lustre2-MDT0000: A client on nid 10.55.14.49 at o2ib was evicted due to a
>> lock blocking callback time out: rc -110
>> mds2-eno1: Oct 22 14:59:36 mds2 kernel: Lustre: lustre2-MDT0000:
>> Connection restored to 3b42ec33-0885-6b7f-6575-9b200c4b6f55 (at
>> 10.55.14.49 at o2ib)
>> mds2-eno1: Oct 22 14:59:37 mds2 kernel: LustreError:
>> 8936:0:(client.c:1166:ptlrpc_import_delay_req()) @@@ IMP_CLOSED
>> req at ffff881b0e68b900 x1635734905828176/t0(0)
>> o104->lustre2-MDT0000 at 10.55.14.49@o2ib:15/16 lens 296/224 e 0 to 0 dl 0
>> ref 1 fl Rpc:/0/ffffffff rc 0/-1
>>
>>
>> Can anyone point me in the right direction on how to debug this issue?
>>
>> Thanks,
>> -Raj
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20191031/c5066b88/attachment-0001.html>


More information about the lustre-discuss mailing list