[lustre-discuss] Kernel oops with lustre 2.15.6 on rocky 9.5 kernel 5.14.0-503.22.1.el9_5.x86_64
Oleg Drokin
green at whamcloud.com
Wed Feb 12 14:29:14 PST 2025
On Wed, 2025-02-12 at 21:26 +0000, Walter, Eric wrote:
>
> Hello,
>
>
> We have recently upgraded a cluster to Rocky 9.5 (kernel
> version5.14.0-503.22.1.el9_5.x86_64). After upgrading to lustre-
> 2.15.6 client, we are seeing repeated kernel oops / crashes when jobs
> are reading/writing to both of our lustre filesystems after about 3-4
> hours of running. It is repeatable and results in a Kernel oops
> referencing the ldlm process of lustre. This is just our clients
> that are on Rocky 9.5, none other systems are having issues.
first hit for ll_prune_negative_children on jira leads to this ticket
that links to the fix:
https://jira.whamcloud.com/browse/LU-18085
>
>
> We would normally mount with o2ib (we upgraded to Mellanox driver
> version 24.10-1.1.4.0 for Rocky 9.5), however, our tests still result
> in the same ldlm kernel oops when mounted over tcp.
>
>
> The oops related output in from vmcore-dmesg.txt is posted below.
>
>
> I have looked for various known issues with 2.15.6 and can't find
> anyone else reporting this. Any ideas on what to do besides
> downgrade to Rocky 9.4? Has anyone else seen such a problem with 9.5
> and clients using v2.15.6?
>
>
> [ 6267.182434] BUG: kernel NULL pointer dereference, address:
> 0000000000000004
> [ 6267.182441] #PF: supervisor write access in kernel mode
> [ 6267.182443] #PF: error_code(0x0002) - not-present page
> [ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0
> [ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI
> 6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded
> Tainted: G OE ------- --- 5.14.0-
> 503.22.1.el9_5.x86_64 #1
> [ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS
> 1.5.8 07/21/2023
> [ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250
> [lustre]
> [ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d
> 48 83 7d 30 00 75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80
> 00 00 00 <80> 48 04 01 8b 45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8
> ac 21 fb
> [ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246
> [ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX:
> 0000000000000000
> [ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI:
> ff28db0fc22c9860
> [ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09:
> ffffffa1dd3f0088
> [ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12:
> ff28db3ed37d9320
> [ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15:
> ff28db0fc22c9860
> [ 6267.182491] FS: 0000000000000000(0000) GS:ff28db3dfebc0000(0000)
> knlGS:0000000000000000
> [ 6267.182493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4:
> 0000000000771ef0
> [ 6267.182495] PKRU: 55555554
> [ 6267.182495] Call Trace:
> [ 6267.182499] <TASK>
> [ 6267.182500] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 6267.182506] ? show_trace_log_lvl+0x26e/0x2df
> [ 6267.182513] ? show_trace_log_lvl+0x26e/0x2df
> [ 6267.182517] ? ll_lock_cancel_bits+0x73a/0x760 [lustre]
> [ 6267.182535] ? __die_body.cold+0x8/0xd
> [ 6267.182538] ? page_fault_oops+0x134/0x170
> [ 6267.182542] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 6267.182545] ? exc_page_fault+0x62/0x150
> [ 6267.182549] ? asm_exc_page_fault+0x22/0x30
> [ 6267.182553] ? ll_prune_negative_children+0x9d/0x250 [lustre]
> [ 6267.182570] ll_lock_cancel_bits+0x73a/0x760 [lustre]
> [ 6267.182588] ll_md_blocking_ast+0x1a3/0x300 [lustre]
> [ 6267.182606] ldlm_cancel_callback+0x7a/0x290 [ptlrpc]
> [ 6267.182639] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 6267.182642] ldlm_cli_cancel_local+0xce/0x440 [ptlrpc]
> [ 6267.182674] ldlm_cli_cancel+0x271/0x520 [ptlrpc]
> [ 6267.182705] ll_md_blocking_ast+0x1cd/0x300 [lustre]
> [ 6267.182722] ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc]
> [ 6267.182753] ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc]
> [ 6267.182782] ldlm_bl_thread_main+0x533/0x610 [ptlrpc]
> [ 6267.182811] ? __pfx_autoremove_wake_function+0x10/0x10
> [ 6267.182817] ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]
> [ 6267.182846] kthread+0xdd/0x100
> [ 6267.182851] ? __pfx_kthread+0x10/0x10
> [ 6267.182853] ret_from_fork+0x29/0x50
> [ 6267.182859] </TASK>
> [ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE)
> fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE)
> lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE)
> nfs lockd grace fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE)
> ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc binfmt_misc vfat fat
> amd_atl intel_rapl_msr ipmi_ssif intel_rapl_common amd64_edac
> dell_wmi edac_mce_amd ledtrig_audio sparse_keymap rfkill kvm_amd
> mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm ipmi_si
> dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor
> rapl wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp
> acpi_power_meter fuse drm xfs libcrc32c mlx5_ib(OE) macsec
> ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) sd_mod t10_pi
> psample ahci mlxdevm(OE) sg libahci mlx_compat(OE) crct10dif_pclmul
> crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3 ccp
> megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash
> dm_log dm_mod xpmem(OE)
> [ 6267.182922] CR2: 0000000000000004
>
>
> Thanks for any help you can provide.
>
>
> Eric
>
>
>
>
>
>
>
> --
> Eric J. Walter
> Executive Director, Research Computing
> Information Technology
>
> William & Mary
> Office: 757-221-1886
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
More information about the lustre-discuss
mailing list