[lustre-discuss] unkillable process using llapi_layout_file_open
John Bauer
bauerj at iodoctors.com
Wed Apr 9 16:44:11 PDT 2025
Andreas,
Thanks for the quick reply. The client version is 2.14.0_ddn173. The
server version is also target_version: 2.14.0.173. This originally
started as the result of user input error that requested an OST that
does not exist. For my simple test case I request an OST that does not
exist, and probably never will exist. This issue is on plieades at
NAS/NASA which doesn't change very much. I doubt that this related to
an OST or MDT that may have been recently added.
The admins are checking on LU-17334.
The admins also noticed thousands of error messages
[root at r593i4n16 ~]# dmesg -T |grep LustreError
[Wed Apr 9 15:36:22 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:23 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:23 2025] LustreError: Skipped 1709 previous similar
messages
[Wed Apr 9 15:36:24 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:24 2025] LustreError: Skipped 3491 previous similar
messages
[Wed Apr 9 15:36:26 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:26 2025] LustreError: Skipped 7803 previous similar
messages
[Wed Apr 9 15:36:30 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:30 2025] LustreError: Skipped 14891 previous similar
messages
[Wed Apr 9 15:36:38 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:38 2025] LustreError: Skipped 29887 previous similar
messages
[Wed Apr 9 15:36:54 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:36:54 2025] LustreError: Skipped 63032 previous similar
messages
[Wed Apr 9 15:37:26 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:37:26 2025] LustreError: Skipped 120772 previous similar
messages
[Wed Apr 9 15:38:30 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:38:30 2025] LustreError: Skipped 238498 previous similar
messages
[Wed Apr 9 15:40:38 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:40:38 2025] LustreError: Skipped 515538 previous similar
messages
[Wed Apr 9 15:44:54 2025] LustreError: 11-0:
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node
10.151.27.142 at o2ib failed: rc = -19
[Wed Apr 9 15:44:54 2025] LustreError: Skipped 1040417 previous similar
messages
[root at r593i4n16 ~]#
John
On 4/9/2025 4:58 PM, Andreas Dilger wrote:
> On Apr 9, 2025, at 14:28, John Bauer via lustre-discuss
> <lustre-discuss at lists.lustre.org> wrote:
>>
>> I have created a small reproducer program (81 lines of code) that
>> results in a process that appears to hang in the kernel, accumulating
>> cpu time. The process is unresponsive to kill commands. From gdb
>> backtrace, it appears the call is stuck somewhere in fsetxattr()
>> which is called by llapi_layout_file_open(). The problem happens
>> only when a non-existent ost is added to the layout with a call to
>> llapi_layout_ost_index_set(). The call to llapi_layout_sanity(),
>> just before calling llapi_layout_file_open(), returns 0. Is this a
>> known issue?
>>
> Hard to say for sure.
>
> I suspect this is related to LU-17334, which relates to newly-added
> MDTs and OSTs in the filesystem. There were a few patches which
> recently landed in 2.16.0 (and backported) that will sleep and retry
> for a short time to handle the case where a client accesses a file or
> directory layout that references an OST or MDT that it doesn't know
> about. The assumption is that the OST/MDT is newly added and the
> configuration update hasn't quite made it to the client yet. The
> client should retry to contact the new server for some time before
> giving up and returning an error (in case the layout is actually bad).
>
> Whether this is fixed in your version depends on what the version is
> (not mentioned in your email). It may also be important what the
> server version is, which can be seen from "lctl get_param mdc.*.import
> | grep target_version", if you can access this parameter. if your
> client & server versions have the LU-17734 fixes, then this would be
> unexpected, and if older versions then I'd say it is something I'd
> rather not revisit until the known fixes are in place.
>
> Cheers, Andreas
> —
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud/DDN
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250409/1190defd/attachment-0001.htm>
More information about the lustre-discuss
mailing list