[lustre-discuss] MDT crashes from assert when an OST mounts

Tue Aug 20 12:47:55 PDT 2024

Just to follow up on this thread; upgrading to lustre 2.12.9 seems to help resolve this issue.

> On Aug 20, 2024, at 7:37 AM, Makia Minich <makia at systemfabricworks.com> wrote:
> 
> Wondering if others may have seen something or know of a remedy.
> 
> Late last week we had a room lose power which meant the filesystem took a hard crash. When power was restored it looked like the JBODS made it through and all of the luns appear to be healthy (after a little bit of rebuilding). The servers were also able to successfully see the luns, so all looked like it was going better than anticipated.
> 
> The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7.
> 
> Bringing up the filesystem is when things went sideways. The MGT mounted with no issue (standard messages of recovery), the MDT also mounted. We proceeded to mount the OSTs when we noticed that suddenly the MDS rebooted with a kernel panic. Looking at dmesg (after it was brought back up) we found the following message:
> 
> [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record for index 0/2
> [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message
> [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
> [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 1 previous similar message
> 
> After a few attempts (hoping it was a fluke) the same message would cause an assert, we had noticed this occurred with two specific OSTs. Leaving those two OSTs down we were able to bring up the rest of the filesystem successfully, but when either of those are mounted it appears that something is triggered and the MDT crashes. Looking at the OSS, there's no messages on the OSS other than losing connection to the MGS (due to the crash).
> 
> We've tried clearing the updatelog and changelog with no change in behavior. So, any other ideas would be appreciated.
> 
> Below is the full dmesg from the start of mounting the MGT:
> 
> [ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. Opts: (null)
> [ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7 at tcp1_0 (at 0 at lo)
> [ 6845.003021] Lustre: Skipped 1 previous similar message
> [ 6853.385804] Lustre: MGS: Connection restored to b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30 at tcp4)
> [ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc
> [ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record for index 0/2
> [ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message
> [ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
> [ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 1 previous similar message
> [ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled, recovery window 300-900
> [ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the first client to connect
> [ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to MGC192.168.240.7 at tcp1_0 (at 0 at lo)
> [ 6872.207536] Lustre: Skipped 1 previous similar message
> [ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least 5:00, or until 7 clients reconnect
> [ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30 at tcp2)
> [ 6908.270429] Lustre: Skipped 4 previous similar messages
> [ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 clients 7 recovered and 0 were evicted.
> [ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
> [ 6984.953509] Lustre: MGS: Connection restored to 83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8 at tcp1)
> [ 6984.953517] Lustre: Skipped 2 previous similar messages
> [ 7115.328345] Lustre: MGS: Connection restored to 1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9 at tcp1)
> [ 7115.328352] Lustre: Skipped 16 previous similar messages
> [ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread()) lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record for index 0/1
> [ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message
> [ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues failed: -22
> [ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed: lustre01-OST0033-osc-MDT0000: 1 0 !empty
> [ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG
> [ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16 18:28:22 UTC 2021
> [ 7201.703865] Call Trace:
> [ 7201.703877]  [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
> [ 7201.703896]  [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs]
> [ 7201.703909]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
> [ 7201.703926]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
> [ 7201.703937]  [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39
> [ 7201.703945]  [<ffffffffffffffff>] 0xffffffffffffffff
> [ 7201.703984] Kernel panic - not syncing: LBUG
> [ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.21.1.el7.x86_64 #1
> [ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR, BIOS 2.9.4 11/06/2020
> [ 7201.714311] Call Trace:
> [ 7201.716865]  [<ffffffffbd98305a>] dump_stack+0x19/0x1b
> [ 7201.719418]  [<ffffffffbd97c5b2>] panic+0xe8/0x21f
> [ 7201.721938]  [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
> [ 7201.724425]  [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]
> [ 7201.726873]  [<ffffffffbd98899f>] ? __schedule+0x3af/0x860
> [ 7201.729286]  [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700 [osp]
> [ 7201.731672]  [<ffffffffbd2c5da1>] kthread+0xd1/0xe0
> [ 7201.734016]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
> [ 7201.736329]  [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21
> [ 7201.738617]  [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240820/ce63591d/attachment.htm>