[lustre-discuss] unkillable process using llapi_layout_file_open

John Bauer bauerj at iodoctors.com
Wed Apr 9 16:44:11 PDT 2025


Andreas,

Thanks for the quick reply.  The client version is 2.14.0_ddn173.  The 
server version is also  target_version: 2.14.0.173.  This originally 
started as the result of user input error that requested an OST that 
does not exist.  For my simple test case I request an OST that does not 
exist, and probably never will exist. This issue is on plieades at 
NAS/NASA which doesn't change very much.  I doubt that this related to 
an OST or MDT that may have been recently added.

The admins are checking on LU-17334.

The admins also noticed thousands of error messages

[root at r593i4n16 ~]# dmesg -T |grep LustreError

[Wed Apr  9 15:36:22 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:23 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:23 2025] LustreError: Skipped 1709 previous similar 
messages

[Wed Apr  9 15:36:24 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:24 2025] LustreError: Skipped 3491 previous similar 
messages

[Wed Apr  9 15:36:26 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:26 2025] LustreError: Skipped 7803 previous similar 
messages

[Wed Apr  9 15:36:30 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:30 2025] LustreError: Skipped 14891 previous similar 
messages

[Wed Apr  9 15:36:38 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:38 2025] LustreError: Skipped 29887 previous similar 
messages

[Wed Apr  9 15:36:54 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:36:54 2025] LustreError: Skipped 63032 previous similar 
messages

[Wed Apr  9 15:37:26 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:37:26 2025] LustreError: Skipped 120772 previous similar 
messages

[Wed Apr  9 15:38:30 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:38:30 2025] LustreError: Skipped 238498 previous similar 
messages

[Wed Apr  9 15:40:38 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:40:38 2025] LustreError: Skipped 515538 previous similar 
messages

[Wed Apr  9 15:44:54 2025] LustreError: 11-0: 
nbp17-MDT0000-mdc-ffff963283f77000: operation ldlm_enqueue to node 
10.151.27.142 at o2ib failed: rc = -19

[Wed Apr  9 15:44:54 2025] LustreError: Skipped 1040417 previous similar 
messages

[root at r593i4n16 ~]#

John

On 4/9/2025 4:58 PM, Andreas Dilger wrote:
> On Apr 9, 2025, at 14:28, John Bauer via lustre-discuss 
> <lustre-discuss at lists.lustre.org> wrote:
>>
>> I have created a small reproducer program (81 lines of code) that 
>> results in a process that appears to hang in the kernel, accumulating 
>> cpu time.  The process is unresponsive to kill commands.  From gdb 
>> backtrace, it appears the call is stuck somewhere in fsetxattr() 
>> which is called by llapi_layout_file_open().  The problem happens 
>> only when a non-existent ost is added to the layout with a call to 
>> llapi_layout_ost_index_set().  The call to llapi_layout_sanity(), 
>> just before calling llapi_layout_file_open(), returns 0.  Is this a 
>> known issue?
>>
> Hard to say for sure.
>
> I suspect this is related to LU-17334, which relates to newly-added 
> MDTs and OSTs in the filesystem. There were a few patches which 
> recently landed in 2.16.0 (and backported) that will sleep and retry 
> for a short time to handle the case where a client accesses a file or 
> directory layout that references an OST or MDT that it doesn't know 
> about.  The assumption is that the OST/MDT is newly added and the 
> configuration update hasn't quite made it to the client yet.  The 
> client should retry to contact the new server for some time before 
> giving up and returning an error (in case the layout is actually bad).
>
> Whether this is fixed in your version depends on what the version is 
> (not mentioned in your email).  It may also be important what the 
> server version is, which can be seen from "lctl get_param mdc.*.import 
> | grep target_version", if you can access this parameter.  if your 
> client & server versions have the LU-17734 fixes, then this would be 
> unexpected, and if older versions then I'd say it is something I'd 
> rather not revisit until the known fixes are in place.
>
> Cheers, Andreas
>> Andreas Dilger
> Lustre Principal Architect
> Whamcloud/DDN
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250409/1190defd/attachment-0001.htm>


More information about the lustre-discuss mailing list