[lustre-discuss] [EXTERNAL] client failing off network

Michael DiDomenico mdidomenico4 at gmail.com
Mon Nov 17 10:40:52 PST 2025


I managed to get a stack trace from a different machine.
interestingly, not on the newer kernel/lustre versions.  having no
proof in hand, here's what i suspect is going on

podman asks for a file from lustre which includes extended attrs
lustre client asks the filesystem for the data
the data is returned to the kernel
the data is scanned by trellix epo
trellix kernel process silently crashes for some reason (probably
because it doesn't handle lustre or xattrs very well/at all)
lustre client hangs

bear in mind i have no proof other then we've seen issues with trellix
before.  i suspect this issue has lurked for a long time, but is only
now showing itself because podman makes use of extended attrs and
locks.  (none of our users knowingly do)

i haven't been able to run printk on the original machine, we have a
conference going on, so i can't muck with the machine at all.  we
unmounted lustre for the time being to get through, then we'll circle
back

this could be a red herring too, just fyi...

[Thu Nov 13 13:19:48 2025] INFO: task podman:79754 blocked for more
than 122 seconds.
[Thu Nov 13 13:19:48 2025]       Tainted: P        W  OE     -------
---  5.14.0-503.14.1.el9_5.x86_64 #1
[Thu Nov 13 13:19:48 2025] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 13 13:19:48 2025] task:podman          state:D stack:0
pid:79754 tgid:79754 ppid:59232  flags:0x00000006
[Thu Nov 13 13:19:48 2025] Call Trace:
[Thu Nov 13 13:19:48 2025]  <TASK>
[Thu Nov 13 13:19:48 2025]  __schedule+0x229/0x550
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  schedule+0x2e/0xd0
[Thu Nov 13 13:19:48 2025]  schedule_preempt_disabled+0x11/0x20
[Thu Nov 13 13:19:48 2025]  __mutex_lock.constprop.0+0x433/0x6a0
[Thu Nov 13 13:19:48 2025]  ? ___slab_alloc+0x626/0x7a0
[Thu Nov 13 13:19:48 2025]  ll_xattr_find_get_lock+0x6c/0x490 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_refill+0xb6/0xb80 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_get+0x286/0x4b0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_list+0x3c5/0x7e0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_get_common+0x184/0x4a0 [lustre]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  __vfs_getxattr+0x50/0x70
[Thu Nov 13 13:19:48 2025]  get_vfs_caps_from_disk+0x70/0x210
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __legitimize_path+0x27/0x60
[Thu Nov 13 13:19:48 2025]  audit_copy_inode+0x99/0xd0
[Thu Nov 13 13:19:48 2025]  filename_lookup+0x17b/0x1d0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_filter_rules.constprop.0+0x2c5/0xd30
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? path_get+0x11/0x30
[Thu Nov 13 13:19:48 2025]  vfs_statx+0x8d/0x170
[Thu Nov 13 13:19:48 2025]  vfs_fstatat+0x54/0x70
[Thu Nov 13 13:19:48 2025]  __do_sys_newfstatat+0x26/0x60
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? auditd_test_task+0x3c/0x50
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __audit_syscall_entry+0xef/0x140
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_trace_enter.constprop.0+0x126/0x1a0
[Thu Nov 13 13:19:48 2025]  do_syscall_64+0x5c/0xf0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __count_memcg_events+0x4f/0xb0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? mm_account_fault+0x6c/0x100
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? handle_mm_fault+0x116/0x270
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_user_addr_fault+0x1d6/0x6a0
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? exc_page_fault+0x62/0x150
[Thu Nov 13 13:19:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Thu Nov 13 13:19:48 2025] RIP: 0033:0x4137ce
[Thu Nov 13 13:19:48 2025] RSP: 002b:000000c0004e0710 EFLAGS: 00000216
ORIG_RAX: 0000000000000106
[Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: ffffffffffffff9c
RCX: 00000000004137ce
[Thu Nov 13 13:19:48 2025] RDX: 000000c0001321d8 RSI: 000000c0001b0120
RDI: ffffffffffffff9c
[Thu Nov 13 13:19:48 2025] RBP: 000000c0004e0750 R08: 0000000000000000
R09: 0000000000000000
[Thu Nov 13 13:19:48 2025] R10: 0000000000000100 R11: 0000000000000216
R12: 000000c0001b0120
[Thu Nov 13 13:19:48 2025] R13: 0000000000000155 R14: 000000c000002380
R15: 000000c0001321a0
[Thu Nov 13 13:19:48 2025]  </TASK>

[Thu Nov 13 13:19:48 2025] INFO: task (ostnamed):79810 blocked for
more than 122 seconds.
[Thu Nov 13 13:19:48 2025]       Tainted: P        W  OE     -------
---  5.14.0-503.14.1.el9_5.x86_64 #1
[Thu Nov 13 13:19:48 2025] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Nov 13 13:19:48 2025] task:(ostnamed)      state:D stack:0
pid:79810 tgid:79810 ppid:1      flags:0x00000006
[Thu Nov 13 13:19:48 2025] Call Trace:
[Thu Nov 13 13:19:48 2025]  <TASK>
[Thu Nov 13 13:19:48 2025]  __schedule+0x229/0x550
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  schedule+0x2e/0xd0
[Thu Nov 13 13:19:48 2025]  schedule_preempt_disabled+0x11/0x20
[Thu Nov 13 13:19:48 2025]  __mutex_lock.constprop.0+0x433/0x6a0
[Thu Nov 13 13:19:48 2025]  ? ___slab_alloc+0x626/0x7a0
[Thu Nov 13 13:19:48 2025]  ll_xattr_find_get_lock+0x6c/0x490 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_refill+0xb6/0xb80 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_cache_get+0x286/0x4b0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_list+0x3c5/0x7e0 [lustre]
[Thu Nov 13 13:19:48 2025]  ll_xattr_get_common+0x184/0x4a0 [lustre]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  __vfs_getxattr+0x50/0x70
[Thu Nov 13 13:19:48 2025]  get_vfs_caps_from_disk+0x70/0x210
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? __legitimize_path+0x27/0x60
[Thu Nov 13 13:19:48 2025]  audit_copy_inode+0x99/0xd0
[Thu Nov 13 13:19:48 2025]  filename_lookup+0x17b/0x1d0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? path_get+0x11/0x30
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_alloc_name+0x138/0x150
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  kern_path+0x2e/0x50
[Thu Nov 13 13:19:48 2025]  mfe_aac_extract_path+0x77/0xe0 [mfe_aac_1007193773]
[Thu Nov 13 13:19:48 2025]  mfe_aac_sys_openat_64_bit+0x114/0x320
[mfe_aac_1007193773]
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? _copy_to_iter+0x17c/0x5f0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? auditd_test_task+0x3c/0x50
[Thu Nov 13 13:19:48 2025]  ?
mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0
[mfe_fileaccess_1007193773]
[Thu Nov 13 13:19:48 2025]
mfe_fileaccess_sys_openat_64_bit+0x2f/0x1f0
[mfe_fileaccess_1007193773]
[Thu Nov 13 13:19:48 2025]  do_syscall_64+0x5c/0xf0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_work+0x103/0x130
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_syscall_64+0x6b/0xf0
[Thu Nov 13 13:19:48 2025]  ? audit_reset_context.part.0.constprop.0+0xe5/0x2e0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? free_to_partial_list+0x80/0x280
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? mntput_no_expire+0x4a/0x250
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? audit_reset_context.part.0.constprop.0+0x273/0x2e0
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_work+0x103/0x130
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? syscall_exit_to_user_mode+0x19/0x40
[Thu Nov 13 13:19:48 2025]  ? srso_alias_return_thunk+0x5/0xfbef5
[Thu Nov 13 13:19:48 2025]  ? do_syscall_64+0x6b/0xf0
[Thu Nov 13 13:19:48 2025]  ? sysvec_apic_timer_interrupt+0x3c/0x90
[Thu Nov 13 13:19:48 2025]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Thu Nov 13 13:19:48 2025] RIP: 0033:0x7fb4f3efdc54
[Thu Nov 13 13:19:48 2025] RSP: 002b:00007ffc61c44c90 EFLAGS: 00000293
ORIG_RAX: 0000000000000101
[Thu Nov 13 13:19:48 2025] RAX: ffffffffffffffda RBX: 0000000000000000
RCX: 00007fb4f3efdc54
[Thu Nov 13 13:19:48 2025] RDX: 00000000002a0000 RSI: 00005599df0a79a0
RDI: 00000000ffffff9c
[Thu Nov 13 13:19:48 2025] RBP: 00005599df0a79a0 R08: 0000000000000000
R09: 0000000000000000
[Thu Nov 13 13:19:48 2025] R10: 0000000000000000 R11: 0000000000000293
R12: 00000000002a0000
[Thu Nov 13 13:19:48 2025] R13: 0000000000000000 R14: 0000000000001c27
R15: 00005599defce360
[Thu Nov 13 13:19:48 2025]  </TASK>


On Fri, Oct 31, 2025 at 2:42 PM John Hearns <hearnsj at gmail.com> wrote:
>
> For information,  arpwatch  can be used to alert on duplicated addresses
>
> https://en.wikipedia.org/wiki/Arpwatch
>
> On Fri, 31 Oct 2025 at 13:13, Michael DiDomenico via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>>
>> unfortunately i don't think so.  we're pretty good about assigning
>> addresses, but still human.  i don't see any evidence of a dup'd
>> address, but i'll keep looking
>>
>> thanks
>>
>> On Thu, Oct 30, 2025 at 8:10 PM Mohr, Rick <mohrrf at ornl.gov> wrote:
>> >
>> > Michael,
>> >
>> > It might be a long shot, but is there any chance another machine has the same IP address as the one having problems?
>> >
>> > --Rick
>> >
>> >
>> >
>> > On 10/30/25, 3:09 PM, "lustre-discuss on behalf of Michael DiDomenico via lustre-discuss" wrote:
>> > our network is running 2.15.6 everywhere on rhel9.5, we recently built a new machine using 2.15.7 on rhel9.6 and i'm seeing a strange problem. the client is ethernet connected to ten lnet routers which bridge ethernet to infiniband. i can mount the client just fine, read/write data, but then several hours later, the client marks all the routers offline. the only recovery is to lazy unmount, lustre_rmmod, and then restart the lustre mount nothing unusual comes out in the journal/dmesg logs. to lustre it "looks" like someone pulled the network cable, but there's no evidence that this has happened physically or even at the switch/software layers we upgraded two other machine to see if the problem replicates, but so far it hasn't. the only significant difference between the three machines is the one with the problem has heavy container (podman) usage, the others have zero. i'm not sure if this is an cause or just a red herring any suggestions
>> >
>> >
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


More information about the lustre-discuss mailing list