[lustre-discuss] MDS crashes, lustre version 2.15.3

Wed Nov 29 08:18:56 PST 2023

Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x134/0x150
[26056.100098] [<0>] ret_from_fork+0x35/0x40
[26056.104575] Kernel panic - not syncing: LBUG
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023
[26056.130108] Call Trace:
[26056.132833]  dump_stack+0x41/0x60
[26056.136532]  panic+0xe7/0x2ac
[26056.139843]  ? ret_from_fork+0x35/0x40
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.161335]  ? wait_for_completion+0xb8/0x100
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.177381]  ? __schedule+0x2d9/0x870
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]
[26056.187010]  kthread+0x134/0x150
[26056.190608]  ? set_kthread_struct+0x50/0x50
[26056.195272]  ret_from_fork+0x35/0x40

We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only
way to bring it back is to reboot the client.

Any suggestions?

Thanks,

Lixin Liu
Simon Fraser University