[lustre-discuss] MDS crashes, lustre version 2.15.3

Thu Nov 30 02:11:46 PST 2023

This is unlikely related. It looks like https://jira.whamcloud.com/browse/LU-16772 which is fixed in 2.16.0 and has a patch for 2.15.

Don't hesitate to query JIRA website with your crash info and see if you find a corresponding bug.

Aurélien

________________________________
De : lustre-discuss <lustre-discuss-bounces at lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss at lists.lustre.org>
Envoyé : mercredi 29 novembre 2023 19:05
À : lustre-discuss <lustre-discuss at lists.lustre.org>
Objet : Re: [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments

Hi Aurelien,

Thanks, I guess we will have to rebuild our own 2.15.x server. I see other crashes have different dump, usually like these:

[36664.403408] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000

[36664.411237] PGD 0 P4D 0

[36664.413776] Oops: 0000 [#1] SMP PTI

[36664.417268] CPU: 28 PID: 11101 Comm: qmt_reba_cedar_ Kdump: loaded Tainted: G          IOE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1

[36664.430293] Hardware name: Dell Inc. PowerEdge R640/0CRT1G, BIOS 2.19.1 06/04/2023

[36664.437860] RIP: 0010:qmt_id_lock_cb+0x69/0x100 [lquota]

[36664.443199] Code: 48 8b 53 20 8b 4a 0c 85 c9 74 78 89 c1 48 8b 42 18 83 78 10 02 75 0a 83 e1 01 b8 01 00 00 00 74 17 48 63 44 24 04 48 c1 e0 04 <48> 03 45 00 f6 40 08 0c 0f 95 c0 0f b6 c0 48 8b 4c 24 08 65 48 33

[36664.461942] RSP: 0018:ffffaa2e303f3df0 EFLAGS: 00010246

[36664.467169] RAX: 0000000000000000 RBX: ffff98722c74b700 RCX: 0000000000000000

[36664.474301] RDX: ffff9880415ce660 RSI: 0000000000000010 RDI: ffff9881240b5c64

[36664.481435] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000004

[36664.488566] R10: 0000000000000010 R11: f000000000000000 R12: ffff98722c74b700

[36664.495697] R13: ffff9875fc07a320 R14: ffff9878444d3d10 R15: ffff9878444d3cc0

[36664.502832] FS:  0000000000000000(0000) GS:ffff987f20f80000(0000) knlGS:0000000000000000

[36664.510917] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[36664.516664] CR2: 0000000000000000 CR3: 0000002065a10004 CR4: 00000000007706e0

[36664.523794] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

[36664.530927] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

[36664.538058] PKRU: 55555554

[36664.540772] Call Trace:

[36664.543231]  ? cfs_cdebug_show.part.3.constprop.23+0x20/0x20 [lquota]

[36664.549699]  qmt_glimpse_lock.isra.20+0x1e7/0xfa0 [lquota]

[36664.555204]  qmt_reba_thread+0x5cd/0x9b0 [lquota]

[36664.559927]  ? qmt_glimpse_lock.isra.20+0xfa0/0xfa0 [lquota]

[36664.565602]  kthread+0x134/0x150

[36664.568834]  ? set_kthread_struct+0x50/0x50

[36664.573021]  ret_from_fork+0x1f/0x40

[36664.576603] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) mbcache jbd2 lustre(OE) lmv(OE) mdc(OE) lov(OE) osc(OE) fid(OE) fld(OE) ksocklnd(OE) ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE) 8021q garp mrp stp llc dell_rbu vfat fat dm_round_robin dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi opa_vnic scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dell_smbios iTCO_wdt iTCO_vendor_support wmi_bmof dell_wmi_descriptor dcdbas kvm ipmi_ssif irqbypass crct10dif_pclmul hfi1 mgag200 crc32_pclmul drm_shmem_helper ghash_clmulni_intel rdmavt qla2xxx drm_kms_helper rapl ib_uverbs nvme_fc intel_cstate syscopyarea nvme_fabrics sysfillrect sysimgblt nvme_core intel_uncore fb_sys_fops pcspkr acpi_ipmi ib_core scsi_transport_fc igb

[36664.576699]  drm ipmi_si i2c_algo_bit mei_me dca ipmi_devintf mei i2c_i801 lpc_ich wmi ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod t10_pi sg ahci libahci crc32c_intel libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod

[36664.684758] CR2: 0000000000000000

Is this also related to the same bug?

Thanks,

Lixin.

From: Aurelien Degremont <adegremont at nvidia.com>
Date: Wednesday, November 29, 2023 at 8:31 AM
To: lustre-discuss <lustre-discuss at lists.lustre.org>, Lixin Liu <liu at sfu.ca>
Subject: RE: MDS crashes, lustre version 2.15.3

You are likely hitting that bug https://jira.whamcloud.com/browse/LU-15207 which is fixed in (not yet released) 2.16.0

Aurélien

________________________________

De : lustre-discuss <lustre-discuss-bounces at lists.lustre.org> de la part de Lixin Liu via lustre-discuss <lustre-discuss at lists.lustre.org>
Envoyé : mercredi 29 novembre 2023 17:18
À : lustre-discuss <lustre-discuss at lists.lustre.org>
Objet : [lustre-discuss] MDS crashes, lustre version 2.15.3

External email: Use caution opening links or attachments

Hi,

We built our 2.15.3 environment a few months ago. MDT is using ldiskfs and OSTs are using ZFS.
The system seems to perform well at the beginning, but recently, we see frequent MDS crashes.
The vmcore-dmesg.txt shows the following:

[26056.031259] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) ASSERTION( !cfs_hash_is_rehashing(hs) ) failed:
[26056.043494] LustreError: 69513:0:(hash.c:1469:cfs_hash_for_each_tight()) LBUG
[26056.051460] Pid: 69513, comm: lquota_wb_cedar 4.18.0-477.10.1.el8_lustre.x86_64 #1 SMP Tue Jun 20 00:12:13 UTC 2023
[26056.063099] Call Trace TBD:
[26056.066221] [<0>] libcfs_call_trace+0x6f/0xa0 [libcfs]
[26056.071970] [<0>] lbug_with_loc+0x3f/0x70 [libcfs]
[26056.077322] [<0>] cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.083839] [<0>] qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.090265] [<0>] qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.096008] [<0>] kthread+0x134/0x150
[26056.100098] [<0>] ret_from_fork+0x35/0x40
[26056.104575] Kernel panic - not syncing: LBUG
[26056.109337] CPU: 18 PID: 69513 Comm: lquota_wb_cedar Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-477.10.1.el8_lustre.x86_64 #1
[26056.123892] Hardware name:  /086D43, BIOS 2.17.0 03/15/2023
[26056.130108] Call Trace:
[26056.132833]  dump_stack+0x41/0x60
[26056.136532]  panic+0xe7/0x2ac
[26056.139843]  ? ret_from_fork+0x35/0x40
[26056.144022]  ? qsd_id_lock_cancel+0x2d0/0x2d0 [lquota]
[26056.149762]  lbug_with_loc.cold.8+0x18/0x18 [libcfs]
[26056.155306]  cfs_hash_for_each_tight+0x301/0x310 [libcfs]
[26056.161335]  ? wait_for_completion+0xb8/0x100
[26056.166196]  qsd_start_reint_thread+0x561/0xcc0 [lquota]
[26056.172128]  qsd_upd_thread+0xd43/0x1040 [lquota]
[26056.177381]  ? __schedule+0x2d9/0x870
[26056.181466]  ? qsd_bump_version+0x3b0/0x3b0 [lquota]
[26056.187010]  kthread+0x134/0x150
[26056.190608]  ? set_kthread_struct+0x50/0x50
[26056.195272]  ret_from_fork+0x35/0x40

We also experienced unexpected OST drop (change to inactive mode) from login nodes and the only
way to bring it back is to reboot the client.

Any suggestions?

Thanks,

Lixin Liu
Simon Fraser University

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=05%7C01%7Cadegremont%40nvidia.com%7C582bca94cf834808213608dbf0f715f4%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638368716249086797%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2BIocBxmJ9zefa%2B1iutVyuD%2FAVdmn%2FpaHnCFiqBIuRgY%3D&reserved=0<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231130/3aea24ff/attachment-0001.htm>