[lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

Fri Sep 1 06:47:32 PDT 2023

Thanks for that information! That command for listing clients with a file open will definitely be useful.

- Thomas Bertschinger
________________________________________
From: Andreas Dilger <adilger at whamcloud.com>
Sent: Wednesday, August 30, 2023 8:02 PM
To: Bertschinger, Thomas Andrew Hjorth
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

You can't directly dump the holders of a particular lock, but it is possible to dump the list of FIDs that each client has open.

  mds# lctl get_param mdt.*.exports.*.open_files | egrep "=|FID" | grep -B1 FID

That should list all client NIDs that have FID open.

It shouldn't be possible for clients to "leak" a group lock, since they are tied to an open file handle and are dropped as soon as the file is closed, or by the kernel when it closes the open fds when the process is killed.

Cheers, Andreas

> On Aug 30, 2023, at 07:42, Bertschinger, Thomas Andrew Hjorth via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>
> Hello,
>
> We have a few files created by a particular application where reads to those files consistently hang. The debug log on a client attempting a read() has messages like:
>
>> ldlm_completion_ast(): waiting indefinitely because of NO_TIMEOUT ...
>
> This is printed when the flag LDLM_FL_NO_TIMEOUT is true, and code comments above that flag imply that it is set for group locks. So, we've been trying to identify if the application in question uses group locks. (I have reached out to the app's developers but do not have a response yet.)
>
> If I open the file with O_NONBLOCK, any reads immediately return with error 11 / EWOULDBLOCK. This behavior is documented to occur for Lustre group locks.
>
> However, I would like to clarify whether the LDLM_FL_NO_TIMEOUT flag is true *only* when a group lock is held, or are there other circumstances where the behavior described above could occur?
>
> If this is caused by a group lock is there an easy way to tell from server side logs or data what client(s) have the group lock and are blocking access? The motivation is that we believe any jobs accessing these files have long since been killed, and no nodes from the job are expected to be holding the files open. We would like to confirm or rule out that possibility by easily identifying any such clients.
>
> Advice on how to effectively debug ldlm issues could be useful beyond just this issue. In general, if there is a reliable way to start from a log entry for a lock like
>
>> ... ns: lustre-OST0000-osc-ffff9a0942c79800 lock: 000000003f3a5950/0xe54ca8d2d7b66d03 lrc: 4/1,0 mode: --/PR  ...
>
> and get information about the client(s) holding that lock and any contending locks, that would be helpful in debugging situations like this.
>
> server: 2.15.2
> client that application ran on: 2.15.0.4_rc2_cray_172_ge66844d
> client that I tested file access from: 2.15.2
>
> Thanks!
>
> - Thomas Bertschinger