<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div><span style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);" class="elementToProof">You are likely hitting that bug<a href="https://jira.whamcloud.com/browse/LU-15207" id="LPlnk475999">
https://jira.whamcloud.com/browse/LU-15207</a> which is fixed in (not yet released) 2.16.0<br>
</span></div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div class="elementToProof" style="font-family: Aptos, Aptos_EmbeddedFont, Aptos_MSFontService, Calibri, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Aurélien<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>De :</b> lustre-discuss <lustre-discuss-bounces@lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Envoyé :</b> mercredi 29 novembre 2023 17:18<br>
<b>À :</b> lustre-discuss <lustre-discuss@lists.lustre.org><br>
<b>Objet :</b> [lustre-discuss] MDS crashes, lustre version 2.15.3</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">External email: Use caution opening links or attachments<br>
<br>
<br>
Hi,<br>
<br>
We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.<br>
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.<br>
The vmcore-dmesg.txt shows the following:<br>
<br>
[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:<br>
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG<br>
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023<br>
[26056.063099] Call Trace TBD:<br>
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]<br>
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]<br>
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.096008] [<0>] kthread+0x134/0x150<br>
[26056.100098] [<0>] ret_from_fork+0x35/0x40<br>
[26056.104575] Kernel panic - not syncing: LBUG<br>
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G OE --------- - - 4.18.0-477.10.1.el8_lustre.x86_64 #1<br>
[26056.123892] Hardware name: /086D43, BIOS 2.17.0 03/15/2023<br>
[26056.130108] Call Trace:<br>
[26056.132833] dump_stack+0x41/0x60<br>
[26056.136532] panic+0xe7/0x2ac<br>
[26056.139843] ? ret_from_fork+0x35/0x40<br>
[26056.144022] ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]<br>
[26056.149762] lbug_with_loc.cold.8+0x18/0x18 [libcfs]<br>
[26056.155306] cfs_hash_for_each_tight+0x301/0x310 [libcfs]<br>
[26056.161335] ? wait_for_completion+0xb8/0x100<br>
[26056.166196] qsd_start_reint_thread+0x561/0xcc0 [lquota]<br>
[26056.172128] qsd_upd_thread+0xd43/0x1040 [lquota]<br>
[26056.177381] ? __schedule+0x2d9/0x870<br>
[26056.181466] ? qsd_bump_version+0x3b0/0x3b0 [lquota]<br>
[26056.187010] kthread+0x134/0x150<br>
[26056.190608] ? set_kthread_struct+0x50/0x50<br>
[26056.195272] ret_from_fork+0x35/0x40<br>
<br>
We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only<br>
way to bring it back is to reboot the client.<br>
<br>
Any suggestions?<br>
<br>
Thanks,<br>
<br>
Lixin Liu<br>
Simon Fraser University<br>
<br>
_______________________________________________<br>
lustre-discuss mailing list<br>
lustre-discuss@lists.lustre.org<br>
<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org">https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0</a><br>
</div>
</span></font></div>
</body>
</html>