[lustre-discuss] Kernel oops with lustre 2.15.6 on rocky 9.5 kernel 5.14.0-503.22.1.el9_5.x86_64
Andreas Dilger
adilger at ddn.com
Wed Feb 12 19:43:18 PST 2025
Even better would be to apply the patch locally on top of 2.15.6, which will allow keeping el9.5 and also confirm that this patch is actually fixing this problem.
Cheers, Andreas
> On Feb 12, 2025, at 17:31, Oleg Drokin <green at whamcloud.com> wrote:
>
> On Wed, 2025-02-12 at 21:26 +0000, Walter, Eric wrote:
>>
>> Hello,
>>
>>
>> We have recently upgraded a cluster to Rocky 9.5 (kernel
>> version5.14.0-503.22.1.el9_5.x86_64). After upgrading to lustre-
>> 2.15.6 client, we are seeing repeated kernel oops / crashes when jobs
>> are reading/writing to both of our lustre filesystems after about 3-4
>> hours of running. It is repeatable and results in a Kernel oops
>> referencing the ldlm process of lustre. This is just our clients
>> that are on Rocky 9.5, none other systems are having issues.
>
> first hit for ll_prune_negative_children on jira leads to this ticket
> that links to the fix:
> https://jira.whamcloud.com/browse/LU-18085
>
>>
>>
>> We would normally mount with o2ib (we upgraded to Mellanox driver
>> version 24.10-1.1.4.0 for Rocky 9.5), however, our tests still result
>> in the same ldlm kernel oops when mounted over tcp.
>>
>>
>> The oops related output in from vmcore-dmesg.txt is posted below.
>>
>>
>> I have looked for various known issues with 2.15.6 and can't find
>> anyone else reporting this. Any ideas on what to do besides
>> downgrade to Rocky 9.4? Has anyone else seen such a problem with 9.5
>> and clients using v2.15.6?
>>
>>
>> [ 6267.182434] BUG: kernel NULL pointer dereference, address:
>> 0000000000000004
>> [ 6267.182441] #PF: supervisor write access in kernel mode
>> [ 6267.182443] #PF: error_code(0x0002) - not-present page
>> [ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0
>> [ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI
>> 6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded
>> Tainted: G OE ------- --- 5.14.0-
>> 503.22.1.el9_5.x86_64 #1
>> [ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS
>> 1.5.8 07/21/2023
>> [ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250
>> [lustre]
>> [ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d
>> 48 83 7d 30 00 75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80
>> 00 00 00 <80> 48 04 01 8b 45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8
>> ac 21 fb
>> [ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246
>> [ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX:
>> 0000000000000000
>> [ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI:
>> ff28db0fc22c9860
>> [ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09:
>> ffffffa1dd3f0088
>> [ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12:
>> ff28db3ed37d9320
>> [ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15:
>> ff28db0fc22c9860
>> [ 6267.182491] FS: 0000000000000000(0000) GS:ff28db3dfebc0000(0000)
>> knlGS:0000000000000000
>> [ 6267.182493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4:
>> 0000000000771ef0
>> [ 6267.182495] PKRU: 55555554
>> [ 6267.182495] Call Trace:
>> [ 6267.182499] <TASK>
>> [ 6267.182500] ? srso_alias_return_thunk+0x5/0xfbef5
>> [ 6267.182506] ? show_trace_log_lvl+0x26e/0x2df
>> [ 6267.182513] ? show_trace_log_lvl+0x26e/0x2df
>> [ 6267.182517] ? ll_lock_cancel_bits+0x73a/0x760 [lustre]
>> [ 6267.182535] ? __die_body.cold+0x8/0xd
>> [ 6267.182538] ? page_fault_oops+0x134/0x170
>> [ 6267.182542] ? srso_alias_return_thunk+0x5/0xfbef5
>> [ 6267.182545] ? exc_page_fault+0x62/0x150
>> [ 6267.182549] ? asm_exc_page_fault+0x22/0x30
>> [ 6267.182553] ? ll_prune_negative_children+0x9d/0x250 [lustre]
>> [ 6267.182570] ll_lock_cancel_bits+0x73a/0x760 [lustre]
>> [ 6267.182588] ll_md_blocking_ast+0x1a3/0x300 [lustre]
>> [ 6267.182606] ldlm_cancel_callback+0x7a/0x290 [ptlrpc]
>> [ 6267.182639] ? srso_alias_return_thunk+0x5/0xfbef5
>> [ 6267.182642] ldlm_cli_cancel_local+0xce/0x440 [ptlrpc]
>> [ 6267.182674] ldlm_cli_cancel+0x271/0x520 [ptlrpc]
>> [ 6267.182705] ll_md_blocking_ast+0x1cd/0x300 [lustre]
>> [ 6267.182722] ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc]
>> [ 6267.182753] ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc]
>> [ 6267.182782] ldlm_bl_thread_main+0x533/0x610 [ptlrpc]
>> [ 6267.182811] ? __pfx_autoremove_wake_function+0x10/0x10
>> [ 6267.182817] ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]
>> [ 6267.182846] kthread+0xdd/0x100
>> [ 6267.182851] ? __pfx_kthread+0x10/0x10
>> [ 6267.182853] ret_from_fork+0x29/0x50
>> [ 6267.182859] </TASK>
>> [ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE)
>> fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE)
>> lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE)
>> nfs lockd grace fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE)
>> ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc binfmt_misc vfat fat
>> amd_atl intel_rapl_msr ipmi_ssif intel_rapl_common amd64_edac
>> dell_wmi edac_mce_amd ledtrig_audio sparse_keymap rfkill kvm_amd
>> mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm ipmi_si
>> dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor
>> rapl wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp
>> acpi_power_meter fuse drm xfs libcrc32c mlx5_ib(OE) macsec
>> ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) sd_mod t10_pi
>> psample ahci mlxdevm(OE) sg libahci mlx_compat(OE) crct10dif_pclmul
>> crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3 ccp
>> megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash
>> dm_log dm_mod xpmem(OE)
>> [ 6267.182922] CR2: 0000000000000004
>>
>>
>> Thanks for any help you can provide.
>>
>>
>> Eric
>>
>>
>>
>>
>>
>>
>>
>> --
>> Eric J. Walter
>> Executive Director, Research Computing
>> Information Technology
>>
>> William & Mary
>> Office: 757-221-1886
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list