<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; color: rgb(0, 0, 0);">
<span style="font-size: 12pt;">Hi </span><span style="font-size: 11pt;">Simppa,</span></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
This patch has already been backported to 2.15 branch. See <a href="https://review.whamcloud.com/c/fs/lustre-release/+/57007" id="LPlnk880732" class="OWAAutoLink">
https://review.whamcloud.com/c/fs/lustre-release/+/57007</a>.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
But it is already merged in b2_15 branch, actually the 3<sup>rd</sup> patch on top of 2.15.6, so you should probably look at this branch.</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
<a href="https://git.whamcloud.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads/b2_15" id="LPlnk">https://git.whamcloud.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads/b2_15</a></div>
<div style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 11pt; color: rgb(0, 0, 0);">
Aurélien</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>De :</b> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> de la part de Äkäslompolo Simppa <simppa.akaslompolo@aalto.fi><br>
<b>Envoyé :</b> jeudi 13 février 2025 07:04<br>
<b>À :</b> Andreas Dilger <adilger@ddn.com>; Oleg Drokin <green@whamcloud.com><br>
<b>Cc :</b> lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>; ejwalt@wm.edu <ejwalt@wm.edu><br>
<b>Objet :</b> Re: [lustre-discuss] Kernel oops with lustre 2.15.6 on rocky 9.5 kernel 5.14.0-503.22.1.el9_5.x86_64</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">[Vous ne recevez pas souvent de courriers de simppa.akaslompolo@aalto.fi. Découvrez pourquoi ceci est important à
<a href="https://aka.ms/LearnAboutSenderIdentification">https://aka.ms/LearnAboutSenderIdentification</a> ]<br>
<br>
External email: Use caution opening links or attachments<br>
<br>
<br>
Hi!<br>
<br>
We have been suffering this with RHEL9.5 a couple of weeks now. I finally got kernel crash dumps saved, and also see similar "RIP: 0010:ll_prune_negative_children"<br>
<br>
I tried applying the patch:<br>
git cherry-pick 983999bda71115595df48d614ca1aaf9b746c75f to commit f7948c626181cda1f72d148adc73ad499eb60307 (HEAD -> b2_15, tag: v2_15_6, tag: 2.15.6, origin/b2_15)<br>
<br>
I get a conflict in lustre/llite/statahead.c<br>
<br>
<<<<<<< HEAD<br>
if (lld_is_init(*dentryp))<br>
ll_d2d(*dentryp)->lld_sa_generation = lli->lli_sa_generation;<br>
sa_put(sai, entry);<br>
spin_lock(&lli->lli_sa_lock);<br>
if (sai->sai_task)<br>
wake_up_process(sai->sai_task);<br>
spin_unlock(&lli->lli_sa_lock);<br>
=======<br>
rcu_read_lock();<br>
lld = ll_d2d(*dentryp);<br>
if (lld)<br>
lld->lld_sa_generation = lli->lli_sa_generation;<br>
rcu_read_unlock();<br>
sa_put(dir, sai, entry);<br>
>>>>>>> 983999bda7 (LU-18085 llite: use RCU to protect the dentry_data)<br>
<br>
<br>
How should we resolve this?<br>
<br>
Thanks.<br>
--<br>
- Simppa -<br>
Mr. Simppa Äkäslompolo<br>
High performance computing specialist<br>
Doctor of Science (Tech.)<br>
Aalto Scientific Computing<br>
School of Science, Aalto University, Finland<br>
<br>
+358-50-5311327<br>
<a href="https://scicomp.aalto.fi/">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fscicomp.aalto.fi%2F&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630565276%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=lJi2pveWOqDeKaRb90YkB2xWFicM2S4XWv6MeH9EpWg%3D&reserved=0</a><br>
<br>
<br>
<br>
________________________________________<br>
From: lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on behalf of Andreas Dilger <adilger@ddn.com><br>
Sent: Thursday, February 13, 2025 05:43<br>
To: Oleg Drokin<br>
Cc: lustre-discuss@lists.lustre.org; ejwalt@wm.edu<br>
Subject: Re: [lustre-discuss] Kernel oops with lustre 2.15.6 on rocky 9.5 kernel 5.14.0-503.22.1.el9_5.x86_64<br>
<br>
Even better would be to apply the patch locally on top of 2.15.6, which will allow keeping el9.5 and also confirm that this patch is actually fixing this problem.<br>
<br>
Cheers, Andreas<br>
<br>
> On Feb 12, 2025, at 17:31, Oleg Drokin <green@whamcloud.com> wrote:<br>
><br>
> On Wed, 2025-02-12 at 21:26 +0000, Walter, Eric wrote:<br>
>><br>
>> Hello,<br>
>><br>
>><br>
>> We have recently upgraded a cluster to Rocky 9.5 (kernel<br>
>> version5.14.0-503.22.1.el9_5.x86_64). After upgrading to lustre-<br>
>> 2.15.6 client, we are seeing repeated kernel oops / crashes when jobs<br>
>> are reading/writing to both of our lustre filesystems after about 3-4<br>
>> hours of running. It is repeatable and results in a Kernel oops<br>
>> referencing the ldlm process of lustre. This is just our clients<br>
>> that are on Rocky 9.5, none other systems are having issues.<br>
><br>
> first hit for ll_prune_negative_children on jira leads to this ticket<br>
> that links to the fix:<br>
> <a href="https://jira.whamcloud.com/browse/LU-18085">https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.whamcloud.com%2Fbrowse%2FLU-18085&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630620196%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=Ep15vWMCNraKwvpBMepp4ifWGoOFr90%2B%2Bj5OsCeUDrU%3D&reserved=0</a><br>
><br>
>><br>
>><br>
>> We would normally mount with o2ib (we upgraded to Mellanox driver<br>
>> version 24.10-1.1.4.0 for Rocky 9.5), however, our tests still result<br>
>> in the same ldlm kernel oops when mounted over tcp.<br>
>><br>
>><br>
>> The oops related output in from vmcore-dmesg.txt is posted below.<br>
>><br>
>><br>
>> I have looked for various known issues with 2.15.6 and can't find<br>
>> anyone else reporting this. Any ideas on what to do besides<br>
>> downgrade to Rocky 9.4? Has anyone else seen such a problem with 9.5<br>
>> and clients using v2.15.6?<br>
>><br>
>><br>
>> [ 6267.182434] BUG: kernel NULL pointer dereference, address:<br>
>> 0000000000000004<br>
>> [ 6267.182441] #PF: supervisor write access in kernel mode<br>
>> [ 6267.182443] #PF: error_code(0x0002) - not-present page<br>
>> [ 6267.182444] PGD 1924d7067 P4D 134554067 PUD 10ac05067 PMD 0<br>
>> [ 6267.182449] Oops: 0002 [#1] PREEMPT SMP NOPTI<br>
>> 6267.182451] CPU: 15 PID: 3599 Comm: ldlm_bl_04 Kdump: loaded<br>
>> Tainted: G OE ------- --- 5.14.0-<br>
>> 503.22.1.el9_5.x86_64 #1<br>
>> [ 6267.182454] Hardware name: Dell Inc. PowerEdge R6625/0NWPW3, BIOS<br>
>> 1.5.8 07/21/2023<br>
>> [ 6267.182455] RIP: 0010:ll_prune_negative_children+0x9d/0x250<br>
>> [lustre]<br>
>> [ 6267.182483] Code: 00 00 48 85 ed 74 46 48 81 ed 98 00 00 00 74 3d<br>
>> 48 83 7d 30 00 75 e4 4c 8d 7d 60 4c 89 ff e8 da 20 fb cf 48 8b 85 80<br>
>> 00 00 00 <80> 48 04 01 8b 45 64 85 c0 0f 84 ae 00 00 00 4c 89 ff e8<br>
>> ac 21 fb<br>
>> [ 6267.182485] RSP: 0018:ff75eed96a0c7c90 EFLAGS: 00010246<br>
>> [ 6267.182487] RAX: 0000000000000000 RBX: ff28db3ed37d92c0 RCX:<br>
>> 0000000000000000<br>
>> [ 6267.182488] RDX: 0000000000000001 RSI: ff28db0fdb1e00b0 RDI:<br>
>> ff28db0fc22c9860<br>
>> [ 6267.182489] RBP: ff28db0fc22c9800 R08: 0000000000000000 R09:<br>
>> ffffffa1dd3f0088<br>
>> [ 6267.182489] R10: ff28db3ec76f5c00 R11: 000000000005eee0 R12:<br>
>> ff28db3ed37d9320<br>
>> [ 6267.182490] R13: ff28db3ece52d528 R14: ff28db3ece52d4a0 R15:<br>
>> ff28db0fc22c9860<br>
>> [ 6267.182491] FS: 0000000000000000(0000) GS:ff28db3dfebc0000(0000)<br>
>> knlGS:0000000000000000<br>
>> [ 6267.182493] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
>> [ 6267.182494] CR2: 0000000000000004 CR3: 0000000138eec006 CR4:<br>
>> 0000000000771ef0<br>
>> [ 6267.182495] PKRU: 55555554<br>
>> [ 6267.182495] Call Trace:<br>
>> [ 6267.182499] <TASK><br>
>> [ 6267.182500] ? srso_alias_return_thunk+0x5/0xfbef5<br>
>> [ 6267.182506] ? show_trace_log_lvl+0x26e/0x2df<br>
>> [ 6267.182513] ? show_trace_log_lvl+0x26e/0x2df<br>
>> [ 6267.182517] ? ll_lock_cancel_bits+0x73a/0x760 [lustre]<br>
>> [ 6267.182535] ? __die_body.cold+0x8/0xd<br>
>> [ 6267.182538] ? page_fault_oops+0x134/0x170<br>
>> [ 6267.182542] ? srso_alias_return_thunk+0x5/0xfbef5<br>
>> [ 6267.182545] ? exc_page_fault+0x62/0x150<br>
>> [ 6267.182549] ? asm_exc_page_fault+0x22/0x30<br>
>> [ 6267.182553] ? ll_prune_negative_children+0x9d/0x250 [lustre]<br>
>> [ 6267.182570] ll_lock_cancel_bits+0x73a/0x760 [lustre]<br>
>> [ 6267.182588] ll_md_blocking_ast+0x1a3/0x300 [lustre]<br>
>> [ 6267.182606] ldlm_cancel_callback+0x7a/0x290 [ptlrpc]<br>
>> [ 6267.182639] ? srso_alias_return_thunk+0x5/0xfbef5<br>
>> [ 6267.182642] ldlm_cli_cancel_local+0xce/0x440 [ptlrpc]<br>
>> [ 6267.182674] ldlm_cli_cancel+0x271/0x520 [ptlrpc]<br>
>> [ 6267.182705] ll_md_blocking_ast+0x1cd/0x300 [lustre]<br>
>> [ 6267.182722] ldlm_handle_bl_callback+0x105/0x3e0 [ptlrpc]<br>
>> [ 6267.182753] ldlm_bl_thread_blwi.constprop.0+0xa7/0x340 [ptlrpc]<br>
>> [ 6267.182782] ldlm_bl_thread_main+0x533/0x610 [ptlrpc]<br>
>> [ 6267.182811] ? __pfx_autoremove_wake_function+0x10/0x10<br>
>> [ 6267.182817] ? __pfx_ldlm_bl_thread_main+0x10/0x10 [ptlrpc]<br>
>> [ 6267.182846] kthread+0xdd/0x100<br>
>> [ 6267.182851] ? __pfx_kthread+0x10/0x10<br>
>> [ 6267.182853] ret_from_fork+0x29/0x50<br>
>> [ 6267.182859] </TASK><br>
>> [ 6267.182860] Modules linked in: mgc(OE) lustre(OE) lmv(OE) mdc(OE)<br>
>> fid(OE) lov(OE) fld(OE) osc(OE) ptlrpc(OE) ko2iblnd(OE) obdclass(OE)<br>
>> lnet(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver libcfs(OE)<br>
>> nfs lockd grace fscache netfs rdma_ucm(OE) rdma_cm(OE) iw_cm(OE)<br>
>> ib_ipoib(OE) ib_cm(OE) ib_umad(OE) sunrpc binfmt_misc vfat fat<br>
>> amd_atl intel_rapl_msr ipmi_ssif intel_rapl_common amd64_edac<br>
>> dell_wmi edac_mce_amd ledtrig_audio sparse_keymap rfkill kvm_amd<br>
>> mgag200 acpi_ipmi i2c_algo_bit video drm_shmem_helper kvm ipmi_si<br>
>> dell_smbios ipmi_devintf dcdbas drm_kms_helper dell_wmi_descriptor<br>
>> rapl wmi_bmof pcspkr i2c_piix4 ipmi_msghandler k10temp<br>
>> acpi_power_meter fuse drm xfs libcrc32c mlx5_ib(OE) macsec<br>
>> ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) sd_mod t10_pi<br>
>> psample ahci mlxdevm(OE) sg libahci mlx_compat(OE) crct10dif_pclmul<br>
>> crc32_pclmul crc32c_intel tls libata ghash_clmulni_intel tg3 ccp<br>
>> megaraid_sas pci_hyperv_intf sp5100_tco wmi dm_mirror dm_region_hash<br>
>> dm_log dm_mod xpmem(OE)<br>
>> [ 6267.182922] CR2: 0000000000000004<br>
>><br>
>><br>
>> Thanks for any help you can provide.<br>
>><br>
>><br>
>> Eric<br>
>><br>
>><br>
>><br>
>><br>
>><br>
>><br>
>><br>
>> --<br>
>> Eric J. Walter<br>
>> Executive Director, Research Computing<br>
>> Information Technology<br>
>><br>
>> William & Mary<br>
>> Office: 757-221-1886 <br>
>> _______________________________________________<br>
>> lustre-discuss mailing list<br>
>> lustre-discuss@lists.lustre.org<br>
>> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630665991%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=EPqN35srvmLMzEAu5mP3IZ2M%2BoKcca1Ax%2Fb%2FmCpaHwE%3D&reserved=0</a><br>
><br>
> _______________________________________________<br>
> lustre-discuss mailing list<br>
> lustre-discuss@lists.lustre.org<br>
> <a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630705489%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=TXki8Lq4p0RG7VoxV6FGiVEMgkmwTs2aJYmWyBs%2FWvU%3D&reserved=0</a><br>
_______________________________________________<br>
lustre-discuss mailing list<br>
lustre-discuss@lists.lustre.org<br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630738112%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=uJIVc0NMVoGH60sQmD4hHXghIJzgknc2m5ajO6b7y1U%3D&reserved=0</a><br>
_______________________________________________<br>
lustre-discuss mailing list<br>
lustre-discuss@lists.lustre.org<br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C02%7Cadegremont%40nvidia.com%7Caaf232691b744159c97c08dd4bf478ea%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638750235630769472%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=A9Wo976HrwbL11srpJadZ8Tmh0dMaWGSxGZrEjYgidI%3D&reserved=0</a><br>
</div>
</span></font></div>
</body>
</html>