[lustre-discuss] MDS crashes, lustre version 2.15.3

Wed Nov 29 08:31:07 PST 2023

You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 which is fixed in (not yet released) 2.16.0

Aurélien
________________________________
De : lustre-discuss <lustre-discuss-bounces at lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss at lists.lustre.org>
Envoyé : mercredi 29 novembre 2023 17:18
À : lustre-discuss <lustre-discuss at lists.lustre.org>
Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments

Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x134/0x150
[26056.100098] [<0>] ret_from_fork+0x35/0x40
[26056.104575] Kernel panic - not syncing: LBUG
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023
[26056.130108] Call Trace:
[26056.132833]  dump_stack+0x41/0x60
[26056.136532]  panic+0xe7/0x2ac
[26056.139843]  ? ret_from_fork+0x35/0x40
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.161335]  ? wait_for_completion+0xb8/0x100
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.177381]  ? __schedule+0x2d9/0x870
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]
[26056.187010]  kthread+0x134/0x150
[26056.190608]  ? set_kthread_struct+0x50/0x50
[26056.195272]  ret_from_fork+0x35/0x40

We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only
way to bring it back is to reboot the client.

Any suggestions?

Thanks,

Lixin Liu
Simon Fraser University

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231129/a7857016/attachment-0001.htm>