<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">This is unlikely related. It looks like
<a href="https://jira.whamcloud.com/browse/LU-16772" id="LPlnk278372">https://jira.whamcloud.com/browse/LU-16772</a> which is fixed in 2.16.0 and has a patch for 2.15.</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">Don't hesitate to query JIRA website with your crash info and
 see if you find a corresponding bug.</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">Aurélien<br>
</span></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>De :</b> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Envoyé :</b> mercredi 29 novembre 2023 19:05<br>
<b>À :</b> lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Objet :</b> Re: [lustre-discuss] MDS crashes, lustre version 2.15.3</font>
<div> </div>
</div>
<style>
<!--
@font-face
        {font-family:"Cambria Math"}
@font-face
        {font-family:DengXian}
@font-face
        {font-family:Calibri}
@font-face
        {font-family:Aptos}
@font-face
        {}
p.x_MsoNormal, li.x_MsoNormal, div.x_MsoNormal
        {margin:0cm;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif}
a:link, span.x_MsoHyperlink
        {color:blue;
        text-decoration:underline}
span.x_elementtoproof
        {}
span.x_EmailStyle20
        {font-family:"Calibri",sans-serif;
        color:windowtext}
.x_MsoChpDefault
        {font-size:10.0pt}
@page WordSection1
        {margin:72.0pt 72.0pt 72.0pt 72.0pt}
div.x_WordSection1
        {}
-->
</style>
<div lang="EN-CA" link="blue" vlink="purple" style="word-wrap:break-word">
<table bgcolor="#FFEB9C" border="1">
<tbody>
<tr>
<td><font face="verdana" color="black" size="1"><b>External email: Use caution opening links or attachments</b>
</font></td>
</tr>
</tbody>
</table>
<br>
<div>
<div class="x_WordSection1">
<p class="x_MsoNormal">Hi Aurelien,</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Thanks, I guess we will have to rebuild our own 2.15.x server. I see other crashes have different dump, usually like these:</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">[36664.403408] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000</p>
<p class="x_MsoNormal">[36664.411237] PGD 0 P4D 0</p>
<p class="x_MsoNormal">[36664.413776] Oops: 0000 [#1] SMP PTI</p>
<p class="x_MsoNormal">[36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: G          IOE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1</p>
<p class="x_MsoNormal">[36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 06/04/2023</p>
<p class="x_MsoNormal">[36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]</p>
<p class="x_MsoNormal">[36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33</p>
<p class="x_MsoNormal">[36664.461942] RSP: 0018:ffffaa2e303f3df0 EFLAGS: 00010246</p>
<p class="x_MsoNormal">[36664.467169] RAX: 0000000000000000 RBX: ffff98722c74b700 RCX: 0000000000000000</p>
<p class="x_MsoNormal">[36664.474301] RDX: ffff9880415ce660 RSI: 0000000000000010 RDI: ffff9881240b5c64</p>
<p class="x_MsoNormal">[36664.481435] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004</p>
<p class="x_MsoNormal">[36664.488566] R10: 0000000000000010 R11: f000000000000000 R12: ffff98722c74b700</p>
<p class="x_MsoNormal">[36664.495697] R13: ffff9875fc07a320 R14: ffff9878444d3d10 R15: ffff9878444d3cc0</p>
<p class="x_MsoNormal">[36664.502832] FS:  0000000000000000(0000) GS:ffff987f20f80000(0000) knlGS:0000000000000000</p>
<p class="x_MsoNormal">[36664.510917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033</p>
<p class="x_MsoNormal">[36664.516664] CR2: 0000000000000000 CR3: 0000002065a10004 CR4: 00000000007706e0</p>
<p class="x_MsoNormal">[36664.523794] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000</p>
<p class="x_MsoNormal">[36664.530927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400</p>
<p class="x_MsoNormal">[36664.538058] PKRU: 55555554</p>
<p class="x_MsoNormal">[36664.540772] Call Trace:</p>
<p class="x_MsoNormal">[36664.543231]  ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota]</p>
<p class="x_MsoNormal">[36664.549699]  qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota]</p>
<p class="x_MsoNormal">[36664.555204]  qmt_reba_thread+0x5cd/0x9b0 [lquota]</p>
<p class="x_MsoNormal">[36664.559927]  ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota]</p>
<p class="x_MsoNormal">[36664.565602]  kthread+0x134/0x150</p>
<p class="x_MsoNormal">[36664.568834]  ? set_kthread_struct+0x50/0x50</p>
<p class="x_MsoNormal">[36664.573021]  ret_from_fork+0x1f/0x40</p>
<p class="x_MsoNormal">[36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) ptlrpc(OE)
 obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm
 intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 mgag200
 crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi ib_core scsi_transport_fc igb</p>
<p class="x_MsoNormal">[36664.576699]  drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod</p>
<p class="x_MsoNormal">[36664.684758] CR2: 0000000000000000</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Is this also related to the same bug?</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Thanks,</p>
<p class="x_MsoNormal"> </p>
<p class="x_MsoNormal">Lixin.</p>
<p class="x_MsoNormal"> </p>
<div style="border:none; border-top:solid #B5C4DF 1.0pt; padding:3.0pt 0cm 0cm 0cm">
<p class="x_MsoNormal"><b><span style="font-size:12.0pt; color:black">From: </span>
</b><span style="font-size:12.0pt; color:black">Aurelien Degremont <adegremont@nvidia.com><br>
<b>Date: </b>Wednesday, November 29, 2023 at 8:31 AM<br>
<b>To: </b>lustre-discuss <lustre-discuss@lists.lustre.org>, Lixin Liu <liu@sfu.ca><br>
<b>Subject: </b>RE: MDS crashes, lustre version 2.15.3</span></p>
</div>
<div>
<p class="x_MsoNormal"> </p>
</div>
<div>
<p class="x_MsoNormal"><span class="x_elementtoproof"><span style="font-size:12.0pt; font-family:"Aptos",sans-serif; color:black">You are likely hitting that bug<a href="https://jira.whamcloud.com/browse/LU-15207" originalsrc="https://jira.whamcloud.com/browse/LU-15207" shash="r88GSq8dnXJtlDxyNoY5HEWmnG7S7E/Q5X9Lj09Xx6qNNzpTJD5iBzi7rczRAk5JNeaZxFe0e/wbpiZi9jVgbp8KpRTCL0nUApgB30QgErwMxj8vzDH0iwHyIUrFz2cquwKZHICilPc1Y/s9sk63p6EI1Fe/f2vfObSGMapdEdU=">
 https://jira.whamcloud.com/browse/LU-15207</a> which is fixed in (not yet released) 2.16.0</span></span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; font-family:"Aptos",sans-serif; color:black"> </span></p>
</div>
<div>
<p class="x_MsoNormal"><span style="font-size:12.0pt; font-family:"Aptos",sans-serif; color:black">Aurélien</span></p>
</div>
<div class="x_MsoNormal" align="center" style="text-align:center">
<hr size="0" width="100%" align="center">
</div>
<div id="x_divRplyFwdMsg">
<p class="x_MsoNormal"><b><span style="color:black">De :</span></b><span style="color:black"> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Envoyé :</b> mercredi 29 novembre 2023 17:18<br>
<b>À :</b> lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Objet :</b> [lustre-discuss] MDS crashes, lustre version 2.15.3</span> </p>
<div>
<p class="x_MsoNormal"> </p>
</div>
</div>
<div>
<div>
<p class="x_MsoNormal">External email: Use caution opening links or attachments<br>
<br>
<br>
Hi,<br>
<br>
We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.<br>
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.<br>
The vmcore-dmesg.txt shows the following:<br>
<br>
[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:<br>
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG<br>
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023<br>
[26056.063099] Call Trace TBD:<br>
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]<br>
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]<br>
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.096008] [<0>] kthread+0x134/0x150<br>
[26056.100098] [<0>] ret_from_fork+0x35/0x40<br>
[26056.104575] Kernel panic - not syncing: LBUG<br>
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1<br>
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023<br>
[26056.130108] Call Trace:<br>
[26056.132833]  dump_stack+0x41/0x60<br>
[26056.136532]  panic+0xe7/0x2ac<br>
[26056.139843]  ? ret_from_fork+0x35/0x40<br>
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]<br>
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]<br>
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.161335]  ? wait_for_completion+0xb8/0x100<br>
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.177381]  ? __schedule+0x2d9/0x870<br>
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]<br>
[26056.187010]  kthread+0x134/0x150<br>
[26056.190608]  ? set_kthread_struct+0x50/0x50<br>
[26056.195272]  ret_from_fork+0x35/0x40<br>
<br>
We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only<br>
way to bring it back is to reboot the client.<br>
<br>
Any suggestions?<br>
<br>
Thanks,<br>
<br>
Lixin Liu<br>
Simon Fraser University<br>
<br>
_______________________________________________<br>
lustre-discuss mailing list<br>
lustre-discuss@lists.lustre.org<br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" originalsrc="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" shash="aEJHmztz6e1oSSyc4GmiozNTqNgfAVXsa2PCqNxVICo5w2Qh6mawxiiT/Pl287CTtvSMT4GcqL/mYnBfF697iTVyatwq/fyVGS7Pf7jM0kIWCJFRjivva+jHV3wGX39G2AbtYqwWKvRzVzjbO/oI1zmQH63THES79gXC9jkGPM8=">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0</a></p>
</div>
</div>
</div>
</div>
</div>
</body>
</html>