[lustre-discuss] MDT hanging

Tue Mar 9 10:15:39 PST 2021

Hi,

We've had a couple of MDT hangs on 2 of our lustre filesystems after updating to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions).

Ths symptoms are a gradualy increasing load on the affected MDS, processes doing I/O on the filesystem blocking indefinately, showing messages on the client similar to:

Mar  9 15:37:22 spectre09 kernel: Lustre: 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1615303641/real 
1615303641]  req at ffff972dbe51bf00 x1692620480891456/t0(0) o44->ahome3-MDT0001-mdc-ffff9718e3be0000 at 10.143.254.212@o2ib:12/10 lens 448/440 e 2 to 1 dl 1615304242 re
f 2 fl Rpc:X/0/ffffffff rc 0/-1
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-ffff9718e3be0000: Connection to ahome3-MDT0001 (at 10.143.254.212 at o2ib) was lost; in progress operatio
ns using this service will wait for recovery to complete
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-ffff9718e3be0000: Connection restored to 10.143.254.212 at o2ib (at 10.143.254.212 at o2ib)

Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp.

Rebooting the affected MDS cleared the problem and everything recovered.

Looking at the MDS system logs, the first sign of trouble appears to be:

Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed (0 == 18446744073709551615)
Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: P           OE  ------------   3.10.0-1160.2.1.el7_lustre.x86_64 #1
Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
Mar  9 15:24:11 amds01b kernel: Call Trace:
Mar  9 15:24:11 amds01b kernel: [<ffffffff9af813c0>] dump_stack+0x19/0x1b
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979f24>] spl_dumpstack+0x44/0x50 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0979ff9>] spl_panic+0xc9/0x110 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96b075>] ? tracing_is_on+0x15/0x30
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a96ed4d>] ? tracing_record_cmdline+0x1d/0x120
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0974fc5>] ? spl_kmem_free+0x35/0x40 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e43cc>] ? update_curr+0x14c/0x1e0
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8e111e>] ? account_entity_dequeue+0xae/0xd0
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7014b>] dbuf_sync_list+0x7b/0xd0 [zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a8f4f0>] dnode_sync+0x370/0x890 [zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0a7b1d1>] sync_dnodes_task+0x61/0x150 [zfs]
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977d7c>] taskq_thread+0x2ac/0x4f0 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8daaf0>] ? wake_up_state+0x20/0x20
Mar  9 15:24:11 amds01b kernel: [<ffffffffc0977ad0>] ? taskq_thread_spawn+0x60/0x60 [spl]
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5c21>] kthread+0xd1/0xe0
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ? insert_kthread_work+0x40/0x40
Mar  9 15:24:11 amds01b kernel: [<ffffffff9af93ddd>] ret_from_fork_nospec_begin+0x7/0x21
Mar  9 15:24:11 amds01b kernel: [<ffffffff9a8c5b50>] ? insert_kthread_work+0x40/0x40

My read of this is that ZFS failed whilst syncing cached data out to disk and panicked (I guess this panic is internal to ZFS as the system remained up and otherwise responsive - no kernel panic triggered). Does this seem correct?

The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool list -H -o health'. Is there any way anyone can think of that we can detect this sort of problem to trigger an automated reset of the affected server? Unfortunately I'd rebooted the server before I spotted the log entry. Next time I'll run some zfs commands to see what they return before rebooting.

Any advice on what additional steps to take? I guess this is probably more a ZFS rather than Lustre issue.

The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 3.10.0-1160.2.1.el7_lustre.x86_64

Kind Regards,
Christopher.