[lustre-discuss] MDT hanging

Tue Mar 9 10:54:41 PST 2021

Hi,
One of the things that the ZFS pacemaker resource does not seem to pick up
failure is when MMP fails due to some problem with the SAS bus. We added
this short script running as a systemd daemon to do a failover when this
happens. The other check in this script is using NHC, mostly to check if
the IB port is up.

If any of those 2 checks are failing, it will try to umount the resource
cleanly instead of doing a power cycle right away, if it fails to umount,
pacemaker will issue the STONITH after the umount timeout.

#!/bin/bashwhile true; do  if /usr/sbin/crm_mon -1 | grep Online |
grep $HOSTNAME > /dev/null ; then    # node is online, do the checks
 # check the status of NHC    if ! /usr/bin/nice -n -5 /usr/sbin/nhc
-t 600; then      echo "NHC failed, failover to the other node"
/usr/sbin/pcs node standby $HOSTNAME    fi    # check if the MMP of
ZFS is not stalled    if /usr/sbin/zpool status | grep "The pool is
suspended because multihost writes failed or were delayed" ; then
echo "ZFS could not write the MMP, failover to the other node"
/usr/sbin/pcs node standby $HOSTNAME    fi  fi  sleep 60;done

On Tue, Mar 9, 2021 at 1:15 PM Christopher Mountford via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:

> Hi,
>
> We've had a couple of MDT hangs on 2 of our lustre filesystems after
> updating to 2.12.6 (though I'm sure I've seen this exact behaviour on
> previous versions).
>
> Ths symptoms are a gradualy increasing load on the affected MDS, processes
> doing I/O on the filesystem blocking indefinately, showing messages on the
> client similar to:
>
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1615303641/real
> 1615303641]  req at ffff972dbe51bf00 x1692620480891456/t0(0)
> o44->ahome3-MDT0001-mdc-ffff9718e3be0000 at 10.143.254.212@o2ib:12/10 lens
> 448/440 e 2 to 1 dl 1615304242 re
> f 2 fl Rpc:X/0/ffffffff rc 0/-1
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> ahome3-MDT0001-mdc-ffff9718e3be0000: Connection to ahome3-MDT0001 (at
> 10.143.254.212 at o2ib) was lost; in progress operatio
> ns using this service will wait for recovery to complete
> Mar  9 15:37:22 spectre09 kernel: Lustre:
> ahome3-MDT0001-mdc-ffff9718e3be0000: Connection restored to
> 10.143.254.212 at o2ib (at 10.143.254.212 at o2ib)
>
> Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to
> /tmp.
>
> Rebooting the affected MDS cleared the problem and everything recovered.
>
>
>
> Looking at the MDS system logs, the first sign of trouble appears to be:
>
> Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level)
> failed (0 == 18446744073709551615)
> Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
> Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
> Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq
> Tainted: P           OE  ------------   3.10.0-1160.2.1.el7_lustre.x86_64 #1
> Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360
> Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
> Mar  9 15:24:11 amds01b kernel: Call Trace:
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9af813c0>] dump_stack+0x19/0x1b
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979f24>]
> spl_dumpstack+0x44/0x50 [spl]
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979ff9>] spl_panic+0xc9/0x110
> [spl]
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96b075>] ?
> tracing_is_on+0x15/0x30
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96ed4d>] ?
> tracing_record_cmdline+0x1d/0x120
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0974fc5>] ?
> spl_kmem_free+0x35/0x40 [spl]
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e43cc>] ?
> update_curr+0x14c/0x1e0
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e111e>] ?
> account_entity_dequeue+0xae/0xd0
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7014b>]
> dbuf_sync_list+0x7b/0xd0 [zfs]
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a8f4f0>]
> dnode_sync+0x370/0x890 [zfs]
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7b1d1>]
> sync_dnodes_task+0x61/0x150 [zfs]
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977d7c>]
> taskq_thread+0x2ac/0x4f0 [spl]
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8daaf0>] ?
> wake_up_state+0x20/0x20
> Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977ad0>] ?
> taskq_thread_spawn+0x60/0x60 [spl]
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5c21>] kthread+0xd1/0xe0
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ?
> insert_kthread_work+0x40/0x40
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9af93ddd>]
> ret_from_fork_nospec_begin+0x7/0x21
> Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ?
> insert_kthread_work+0x40/0x40
>
>
>
>
> My read of this is that ZFS failed whilst syncing cached data out to disk
> and panicked (I guess this panic is internal to ZFS as the system remained
> up and otherwise responsive - no kernel panic triggered). Does this seem
> correct?
>
> The pacemaker ZFS resource did not pick up the failure, it relies on
> 'zpool list -H -o health'. Is there any way anyone can think of that we can
> detect this sort of problem to trigger an automated reset of the affected
> server? Unfortunately I'd rebooted the server before I spotted the log
> entry. Next time I'll run some zfs commands to see what they return before
> rebooting.
>
> Any advice on what additional steps to take? I guess this is probably more
> a ZFS rather than Lustre issue.
>
> The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on
> ZFS, Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel
> 3.10.0-1160.2.1.el7_lustre.x86_64
>
> Kind Regards,
> Christopher.
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210309/186af412/attachment.html>