<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:DengXian;
panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Aptos;
panose-1:2 11 0 4 2 2 2 2 2 4;}
@font-face
{font-family:"\@DengXian";
panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
span.elementtoproof
{mso-style-name:elementtoproof;}
span.EmailStyle20
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
mso-ligatures:none;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-CA" link="blue" vlink="purple" style="word-wrap:break-word">
<div class="WordSection1">
<p class="MsoNormal">Hi Aurelien,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks, I guess we will have to rebuild our own 2.15.x server. I see other crashes have different dump, usually like these:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">[36664.403408] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000<o:p></o:p></p>
<p class="MsoNormal">[36664.411237] PGD 0 P4D 0<o:p></o:p></p>
<p class="MsoNormal">[36664.413776] Oops: 0000 [#1] SMP PTI<o:p></o:p></p>
<p class="MsoNormal">[36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: G IOE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1<o:p></o:p></p>
<p class="MsoNormal">[36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 06/04/2023<o:p></o:p></p>
<p class="MsoNormal">[36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]<o:p></o:p></p>
<p class="MsoNormal">[36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33<o:p></o:p></p>
<p class="MsoNormal">[36664.461942] RSP: 0018:ffffaa2e303f3df0 EFLAGS: 00010246<o:p></o:p></p>
<p class="MsoNormal">[36664.467169] RAX: 0000000000000000 RBX: ffff98722c74b700 RCX: 0000000000000000<o:p></o:p></p>
<p class="MsoNormal">[36664.474301] RDX: ffff9880415ce660 RSI: 0000000000000010 RDI: ffff9881240b5c64<o:p></o:p></p>
<p class="MsoNormal">[36664.481435] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004<o:p></o:p></p>
<p class="MsoNormal">[36664.488566] R10: 0000000000000010 R11: f000000000000000 R12: ffff98722c74b700<o:p></o:p></p>
<p class="MsoNormal">[36664.495697] R13: ffff9875fc07a320 R14: ffff9878444d3d10 R15: ffff9878444d3cc0<o:p></o:p></p>
<p class="MsoNormal">[36664.502832] FS: 0000000000000000(0000) GS:ffff987f20f80000(0000) knlGS:0000000000000000<o:p></o:p></p>
<p class="MsoNormal">[36664.510917] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033<o:p></o:p></p>
<p class="MsoNormal">[36664.516664] CR2: 0000000000000000 CR3: 0000002065a10004 CR4: 00000000007706e0<o:p></o:p></p>
<p class="MsoNormal">[36664.523794] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<o:p></o:p></p>
<p class="MsoNormal">[36664.530927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400<o:p></o:p></p>
<p class="MsoNormal">[36664.538058] PKRU: 55555554<o:p></o:p></p>
<p class="MsoNormal">[36664.540772] Call Trace:<o:p></o:p></p>
<p class="MsoNormal">[36664.543231] ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota]<o:p></o:p></p>
<p class="MsoNormal">[36664.549699] qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota]<o:p></o:p></p>
<p class="MsoNormal">[36664.555204] qmt_reba_thread+0x5cd/0x9b0 [lquota]<o:p></o:p></p>
<p class="MsoNormal">[36664.559927] ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota]<o:p></o:p></p>
<p class="MsoNormal">[36664.565602] kthread+0x134/0x150<o:p></o:p></p>
<p class="MsoNormal">[36664.568834] ? set_kthread_struct+0x50/0x50<o:p></o:p></p>
<p class="MsoNormal">[36664.573021] ret_from_fork+0x1f/0x40<o:p></o:p></p>
<p class="MsoNormal">[36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) ptlrpc(OE)
obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm
intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 mgag200
crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi ib_core scsi_transport_fc igb<o:p></o:p></p>
<p class="MsoNormal">[36664.576699] drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod<o:p></o:p></p>
<p class="MsoNormal">[36664.684758] CR2: 0000000000000000<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Is this also related to the same bug?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Lixin.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span style="font-size:12.0pt;color:black">From: </span></b><span style="font-size:12.0pt;color:black">Aurelien Degremont <adegremont@nvidia.com><br>
<b>Date: </b>Wednesday, November 29, 2023 at 8:31 AM<br>
<b>To: </b>lustre-discuss <lustre-discuss@lists.lustre.org>, Lixin Liu <liu@sfu.ca><br>
<b>Subject: </b>RE: MDS crashes, lustre version 2.15.3<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span class="elementtoproof"><span style="font-size:12.0pt;font-family:"Aptos",sans-serif;color:black">You are likely hitting that bug<a href="https://jira.whamcloud.com/browse/LU-15207"> https://jira.whamcloud.com/browse/LU-15207</a> which
is fixed in (not yet released) 2.16.0</span></span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"Aptos",sans-serif;color:black"><o:p> </o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:12.0pt;font-family:"Aptos",sans-serif;color:black">Aurélien<o:p></o:p></span></p>
</div>
<div class="MsoNormal" align="center" style="text-align:center">
<hr size="0" width="100%" align="center">
</div>
<div id="divRplyFwdMsg">
<p class="MsoNormal"><b><span style="color:black">De :</span></b><span style="color:black"> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Envoyé :</b> mercredi 29 novembre 2023 17:18<br>
<b>À :</b> lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Objet :</b> [lustre-discuss] MDS crashes, lustre version 2.15.3</span> <o:p></o:p></p>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
</div>
<div>
<div>
<p class="MsoNormal">External email: Use caution opening links or attachments<br>
<br>
<br>
Hi,<br>
<br>
We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.<br>
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.<br>
The vmcore-dmesg.txt shows the following:<br>
<br>
[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:<br>
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG<br>
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023<br>
[26056.063099] Call Trace TBD:<br>
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]<br>
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]<br>
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.096008] [<0>] kthread+0x134/0x150<br>
[26056.100098] [<0>] ret_from_fork+0x35/0x40<br>
[26056.104575] Kernel panic - not syncing: LBUG<br>
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1<br>
[26056.123892] Hardware name: /086D43, BIOS 2.17.0 03/15/2023<br>
[26056.130108] Call Trace:<br>
[26056.132833] dump_stack+0x41/0x60<br>
[26056.136532] panic+0xe7/0x2ac<br>
[26056.139843] ? ret_from_fork+0x35/0x40<br>
[26056.144022] ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]<br>
[26056.149762] lbug_with_loc.cold.8+0x18/0x18 [libcfs]<br>
[26056.155306] cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.161335] ? wait_for_completion+0xb8/0x100<br>
[26056.166196] qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.172128] qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.177381] ? __schedule+0x2d9/0x870<br>
[26056.181466] ? qsd_bump_version+0x3b0/0x3b0 [lquota]<br>
[26056.187010] kthread+0x134/0x150<br>
[26056.190608] ? set_kthread_struct+0x50/0x50<br>
[26056.195272] ret_from_fork+0x35/0x40<br>
<br>
We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only<br>
way to bring it back is to reboot the client.<br>
<br>
Any suggestions?<br>
<br>
Thanks,<br>
<br>
Lixin Liu<br>
Simon Fraser University<br>
<br>
_______________________________________________<br>
lustre-discuss mailing list<br>
lustre-discuss@lists.lustre.org<br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0</a><o:p></o:p></p>
</div>
</div>
</div>
</body>
</html>