[lustre-discuss] Repeated ZFS panics on MDT

Mountford, Christopher J. (Dr.) cjm14 at leicester.ac.uk
Wed Mar 15 10:55:50 PDT 2023


I'm hoping someone offer some suggestions.

We have a problem on our production Lustre/ZFS filesystem (CentOS 7, ZFS 0.7.13, Lustre 2.12.9), so far I've drawn a blank trying to track down the cause of this.

We see the following zfs panic message in the logs (in every case the VERIFY3/panic lines are identical):


Mar 15 17:15:39 amds01a kernel: VERIFY3(sa.sa_magic == 0x2F505A) failed (8 == 3100762)
Mar 15 17:15:39 amds01a kernel: PANIC at zfs_vfsops.c:584:zfs_space_delta_cb()
Mar 15 17:15:39 amds01a kernel: Showing stack for process 15381
Mar 15 17:15:39 amds01a kernel: CPU: 31 PID: 15381 Comm: mdt00_020 Tainted: P           OE  ------------   3.10.0-1160.49.1.el7_lustre.x86_64 #1
Mar 15 17:15:39 amds01a kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 02/09/2023
Mar 15 17:15:39 amds01a kernel: Call Trace:
Mar 15 17:15:39 amds01a kernel: [<ffffffff99d83539>] dump_stack+0x19/0x1b
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76f24>] spl_dumpstack+0x44/0x50 [spl]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b76ff9>] spl_panic+0xc9/0x110 [spl]
Mar 15 17:15:39 amds01a kernel: [<ffffffff996e482c>] ? update_curr+0x14c/0x1e0
Mar 15 17:15:39 amds01a kernel: [<ffffffff99707cf4>] ? getrawmonotonic64+0x34/0xc0
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c87aa3>] ? dmu_zfetch+0x393/0x520 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c6a073>] ? dbuf_rele_and_unlock+0x283/0x4c0 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0b78ff1>] ? __cv_init+0x41/0x60 [spl]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0d0f53c>] zfs_space_delta_cb+0x9c/0x200 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c7a944>] dmu_objset_userquota_get_ids+0x154/0x440 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c89e98>] dnode_setdirty+0x38/0xf0 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c8a21c>] dnode_allocate+0x18c/0x230 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc0c76d2b>] dmu_object_alloc_dnsize+0x34b/0x3e0 [zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d73052>] __osd_object_create+0x82/0x170 [osd_zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d7ce23>] ? osd_declare_xattr_set+0xb3/0x190 [osd_zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d733bd>] osd_mkreg+0x7d/0x210 [osd_zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffff99828f01>] ? __kmalloc_node+0x1d1/0x2b0
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d6f8f6>] osd_create+0x336/0xb10 [osd_zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc2016fb5>] lod_sub_create+0x1f5/0x480 [lod]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc2007729>] lod_create+0x69/0x340 [lod]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1d65690>] ? osd_trans_create+0x410/0x410 [osd_zfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc2081993>] mdd_create_object_internal+0xc3/0x300 [mdd]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc206aa4b>] mdd_create_object+0x7b/0x820 [mdd]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc2074fd8>] mdd_create+0xdd8/0x14a0 [mdd]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f0e118>] mdt_reint_open+0x2588/0x3970 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc16f82b9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1eede52>] ? ucred_set_audit_enabled.isra.15+0x22/0x60 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1f00f23>] mdt_reint_rec+0x83/0x210 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edc413>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee8ec6>] ? mdt_intent_fixup_resent+0x36/0x220 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee9132>] mdt_intent_open+0x82/0x3a0 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1edf74a>] mdt_intent_opc+0x1ba/0xb50 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d6c0>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee90b0>] ? mdt_intent_fixup_resent+0x220/0x220 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1ee79e4>] mdt_intent_policy+0x1a4/0x360 [mdt]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc19bc4e6>] ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a22b7>] ? cfs_hash_bd_add_locked+0x67/0x90 [libcfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc10a5a4e>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc19e3aa6>] ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a0d740>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a6d092>] tgt_enqueue+0x62/0x210 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a73eea>] tgt_request_handle+0xada/0x1570 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a4d601>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1096bde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a18bcb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a156e5>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffff99d7dcf3>] ? queued_spin_lock_slowpath+0xb/0xf
Mar 15 17:15:39 amds01a kernel: [<ffffffff99d8baa0>] ? _raw_spin_lock+0x20/0x30
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1c534>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffffc1a1ba00>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5e61>] kthread+0xd1/0xe0
Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? insert_kthread_work+0x40/0x40
Mar 15 17:15:39 amds01a kernel: [<ffffffff99d95ddd>] ret_from_fork_nospec_begin+0x7/0x21
Mar 15 17:15:39 amds01a kernel: [<ffffffff996c5d90>] ? insert_kthread_work+0x40/0x40

At this point all ZFS I/O freezes completely and the MDS has to be fenced. This has happened ~4 times in the last hour. 

I'm at a loss how to correct this - I'm currently thinking that we may have to rebuild and recover our entire filesystem from backups (thankfully this is our home file system which is small and entirely ssd based, so should not take to long to recover).

May be related to this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216586 bug seen on freebsd (with a much more recent ZFS version).

The problem was first seen 3 weeks ago, but went away after a couple of reboots. This time it seems to be more serious.

Kind Regards,
Christopher.

------------------------------------
Dr. Christopher Mountford,
System Specialist,
RCS,
Digital Services,
University Of Leicester.


More information about the lustre-discuss mailing list