[lustre-discuss] Kernel oops with lustre 2.15.6 on rocky 9.5 kernel 5.14.0-503.22.1.el9_5.x86_64

Walter, Eric ejwalt at wm.edu
Wed Feb 12 13:26:08 PST 2025


Hello,

We have recently upgraded a cluster to Rocky 9.5 (kernel version5.14.0-503.22.1.el9_5.x86_64).  After upgrading to lustre-2.15.6 client, we are seeing repeated kernel oops / crashes when jobs are reading/writing to both of our lustre filesystems after about 3-4 hours of running.  It is repeatable and results in a Kernel oops referencing the ldlm process of lustre.   This is just our clients that are on Rocky 9.5, none other systems are having issues.

We would normally mount with o2ib (we upgraded to Mellanox driver version 24.10-1.1.4.0 for Rocky 9.5), however, our tests still result in the same ldlm kernel oops when mounted over tcp.

The oops related output in from vmcore-dmesg.txt is posted below.

I have looked for various known issues with 2.15.6 and can't find anyone else reporting this.    Any ideas on what to do besides downgrade to Rocky 9.4?  Has anyone else seen such a problem with 9.5 and clients using v2.15.6?

[ 6267.182434] BUG: kernel NULL pointer dereference, address: 0000000000000004
[ 6267.182441] #PF: supervisor write access in kernel mode
[ 6267.182443] #PF: error_code(0x0002) - not-present page
[ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0
[ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI
6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded Tainted: G           OE     -------  ---  5.14.0-503.22.1.el9_5.x86_64 #1
[ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS 1.5.8 07/21/2023
[ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250 [lustre]
[ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d 48 83 7d 30 00 75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80 00 00 00 <80> 48 04 01 8b 45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8 ac 21 fb
[ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246
[ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX: 0000000000000000
[ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI: ff28db0fc22c9860
[ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09: ffffffa1dd3f0088
[ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12: ff28db3ed37d9320
[ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15: ff28db0fc22c9860
[ 6267.182491] FS:  0000000000000000(0000) GS:ff28db3dfebc0000(0000) knlGS:0000000000000000
[ 6267.182493] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4: 0000000000771ef0
[ 6267.182495] PKRU: 55555554
[ 6267.182495] Call Trace:
[ 6267.182499]  <TASK>
[ 6267.182500]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182506]  ? show_trace_log_lvl+0x26e/0x2df
[ 6267.182513]  ? show_trace_log_lvl+0x26e/0x2df
[ 6267.182517]  ? ll_lock_cancel_bits+0x73a/0x760 [lustre]
[ 6267.182535]  ? __die_body.cold+0x8/0xd
[ 6267.182538]  ? page_fault_oops+0x134/0x170
[ 6267.182542]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182545]  ? exc_page_fault+0x62/0x150
[ 6267.182549]  ? asm_exc_page_fault+0x22/0x30
[ 6267.182553]  ? ll_prune_negative_children+0x9d/0x250 [lustre]
[ 6267.182570]  ll_lock_cancel_bits+0x73a/0x760 [lustre]
[ 6267.182588]  ll_md_blocking_ast+0x1a3/0x300 [lustre]
[ 6267.182606]  ldlm_cancel_callback+0x7a/0x290 [ptlrpc]
[ 6267.182639]  ? srso_alias_return_thunk+0x5/0xfbef5
[ 6267.182642]  ldlm_cli_cancel_local+0xce/0x440 [ptlrpc]
[ 6267.182674]  ldlm_cli_cancel+0x271/0x520 [ptlrpc]
[ 6267.182705]  ll_md_blocking_ast+0x1cd/0x300 [lustre]
[ 6267.182722]  ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc]
[ 6267.182753]  ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc]
[ 6267.182782]  ldlm_bl_thread_main+0x533/0x610 [ptlrpc]
[ 6267.182811]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 6267.182817]  ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]
[ 6267.182846]  kthread+0xdd/0x100
[ 6267.182851]  ? __pfx_kthread+0x10/0x10
[ 6267.182853]  ret_from_fork+0x29/0x50
[ 6267.182859]  </TASK>
[ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE) fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE) lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE) nfs lockd grace fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc binfmt_misc vfat fat amd_atl intel_rapl_msr ipmi_ssif intel_rapl_common amd64_edac dell_wmi edac_mce_amd ledtrig_audio sparse_keymap rfkill kvm_amd mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm ipmi_si dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor rapl wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp acpi_power_meter fuse drm xfs libcrc32c mlx5_ib(OE) macsec ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) sd_mod t10_pi psample ahci mlxdevm(OE) sg libahci mlx_compat(OE) crct10dif_pclmul crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3 ccp megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash dm_log dm_mod xpmem(OE)
[ 6267.182922] CR2: 0000000000000004

Thanks for any help you can provide.

Eric


--
Eric J. Walter
Executive Director, Research Computing
Information Technology
William & Mary
Office: 757-221-1886    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250212/aec53ac2/attachment-0001.htm>


More information about the lustre-discuss mailing list