<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Just to follow up on this thread; upgrading to lustre 2.12.9 seems to help resolve this issue.<br id="lineBreakAtBeginningOfMessage">
<div><br><blockquote type="cite"><div>On Aug 20, 2024, at 7:37 AM, Makia Minich <makia@systemfabricworks.com> wrote:</div><br class="Apple-interchange-newline"><div><meta http-equiv="content-type" content="text/html; charset=us-ascii"><div style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Wondering if others may have seen something or know of a remedy.<div><br></div><div>Late last week we had a room lose power which meant the filesystem took a hard crash. When power was restored it looked like the JBODS made it through and all of the luns appear to be healthy (after a little bit of rebuilding). The servers were also able to successfully see the luns, so all looked like it was going better than anticipated.</div><div><br></div><div>The system (both server and clients) is CentOS 7.9 with Lustre 2.12.7.</div><div><br></div><div>Bringing up the filesystem is when things went sideways. The MGT mounted with no issue (standard messages of recovery), the MDT also mounted. We proceeded to mount the OSTs when we noticed that suddenly the MDS rebooted with a kernel panic. Looking at dmesg (after it was brought back up) we found the following message:</div><div><br></div><div><div><font face="Courier New">[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record for index 0/2</font></div><div><font face="Courier New">[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: -22</font></div><div><font face="Courier New">[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 1 previous similar message</font></div></div><div><br></div><div>After a few attempts (hoping it was a fluke) the same message would cause an assert, we had noticed this occurred with two specific OSTs. Leaving those two OSTs down we were able to bring up the rest of the filesystem successfully, but when either of those are mounted it appears that something is triggered and the MDT crashes. Looking at the OSS, there's no messages on the OSS other than losing connection to the MGS (due to the crash).</div><div><br></div><div>We've tried clearing the updatelog and changelog with no change in behavior. So, any other ideas would be appreciated.</div><div><br></div><div>Below is the full dmesg from the start of mounting the MGT:</div><div><br></div><div><div><font face="Courier New">[ 4881.624345] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. Opts: (null)</font></div><div><font face="Courier New">[ 6844.490777] LDISKFS-fs (scinib): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc</font></div><div><font face="Courier New">[ 6845.003014] Lustre: MGS: Connection restored to MGC192.168.240.7@tcp1_0 (at 0@lo)</font></div><div><font face="Courier New">[ 6845.003021] Lustre: Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 6853.385804] Lustre: MGS: Connection restored to b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.9.30@tcp4)</font></div><div><font face="Courier New">[ 6865.882492] LDISKFS-fs (scinia): mounted filesystem with ordered data mode. Opts: user_xattr,errors=remount-ro,no_mbcache,nodelalloc</font></div><div><font face="Courier New">[ 6867.143694] Lustre: 79890:0:(llog.c:615:llog_process_thread()) lustre01-OST0032-osc-MDT0000: invalid length 0 in llog [0x52ab:0x1:0x0]record for index 0/2</font></div><div><font face="Courier New">[ 6867.143705] Lustre: 79890:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 6867.143720] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0032-osc-MDT0000: llog process with osp_sync_process_queues failed: -22</font></div><div><font face="Courier New">[ 6867.148800] LustreError: 79890:0:(osp_sync.c:1272:osp_sync_thread()) Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 6867.221923] Lustre: lustre01-MDT0000: Imperative Recovery not enabled, recovery window 300-900</font></div><div><font face="Courier New">[ 6867.234362] Lustre: lustre01-MDT0000: in recovery but waiting for the first client to connect</font></div><div><font face="Courier New">[ 6872.207528] Lustre: lustre01-MDT0000: Connection restored to MGC192.168.240.7@tcp1_0 (at 0@lo)</font></div><div><font face="Courier New">[ 6872.207536] Lustre: Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 6902.340582] Lustre: lustre01-MDT0000: Will be in recovery for at least 5:00, or until 7 clients reconnect</font></div><div><font face="Courier New">[ 6908.270425] Lustre: lustre01-MDT0000: Connection restored to b22a0a27-e8e5-a57b-534e-d5f9571b6e9f (at 192.168.249.30@tcp2)</font></div><div><font face="Courier New">[ 6908.270429] Lustre: Skipped 4 previous similar messages</font></div><div><font face="Courier New">[ 6908.446460] Lustre: lustre01-MDT0000: Recovery over after 0:06, of 7 clients 7 recovered and 0 were evicted.</font></div><div><font face="Courier New">[ 6977.979707] perf: interrupt took too long (2501 > 2500), lowering kernel.perf_event_max_sample_rate to 79000</font></div><div><font face="Courier New">[ 6984.953509] Lustre: MGS: Connection restored to 83b8e6bc-7407-a532-4b8e-0ae1a4982885 (at 192.168.240.8@tcp1)</font></div><div><font face="Courier New">[ 6984.953517] Lustre: Skipped 2 previous similar messages</font></div><div><font face="Courier New">[ 7115.328345] Lustre: MGS: Connection restored to 1ad84e77-29b8-8d86-73e4-7dcd263c303b (at 192.168.240.9@tcp1)</font></div><div><font face="Courier New">[ 7115.328352] Lustre: Skipped 16 previous similar messages</font></div><div><font face="Courier New">[ 7201.690060] Lustre: 79892:0:(llog.c:615:llog_process_thread()) lustre01-OST0033-osc-MDT0000: invalid length 0 in llog [0x52ad:0x1:0x0]record for index 0/1</font></div><div><font face="Courier New">[ 7201.690069] Lustre: 79892:0:(llog.c:615:llog_process_thread()) Skipped 1 previous similar message</font></div><div><font face="Courier New">[ 7201.690086] LustreError: 79892:0:(osp_sync.c:1272:osp_sync_thread()) lustre01-OST0033-osc-MDT0000: llog process with osp_sync_process_queues failed: -22</font></div><div><font face="Courier New">[ 7201.695902] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) ASSERTION( atomic_read(&d->opd_sync_rpcs_in_progress) == 0 ) failed: lustre01-OST0033-osc-MDT0000: 1 0 !empty</font></div><div><font face="Courier New">[ 7201.701242] LustreError: 79892:0:(osp_sync.c:1317:osp_sync_thread()) LBUG</font></div><div><font face="Courier New">[ 7201.703862] Pid: 79892, comm: osp-syn-51-0 3.10.0-1160.21.1.el7.x86_64 #1 SMP Tue Mar 16 18:28:22 UTC 2021</font></div><div><font face="Courier New">[ 7201.703865] Call Trace:</font></div><div><font face="Courier New">[ 7201.703877] [<ffffffffc0f007cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]</font></div><div><font face="Courier New">[ 7201.703896] [<ffffffffc0f0087c>] lbug_with_loc+0x4c/0xa0 [libcfs]</font></div><div><font face="Courier New">[ 7201.703909] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]</font></div><div><font face="Courier New">[ 7201.703926] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0</font></div><div><font face="Courier New">[ 7201.703937] [<ffffffffbd995df7>] ret_from_fork_nospec_end+0x0/0x39</font></div><div><font face="Courier New">[ 7201.703945] [<ffffffffffffffff>] 0xffffffffffffffff</font></div><div><font face="Courier New">[ 7201.703984] Kernel panic - not syncing: LBUG</font></div><div><font face="Courier New">[ 7201.706561] CPU: 37 PID: 79892 Comm: osp-syn-51-0 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.21.1.el7.x86_64 #1</font></div><div><font face="Courier New">[ 7201.711716] Hardware name: Dell Inc. VxFlex integrated rack R640 S/0H28RR, BIOS 2.9.4 11/06/2020</font></div><div><font face="Courier New">[ 7201.714311] Call Trace:</font></div><div><font face="Courier New">[ 7201.716865] [<ffffffffbd98305a>] dump_stack+0x19/0x1b</font></div><div><font face="Courier New">[ 7201.719418] [<ffffffffbd97c5b2>] panic+0xe8/0x21f</font></div><div><font face="Courier New">[ 7201.721938] [<ffffffffc0f008cb>] lbug_with_loc+0x9b/0xa0 [libcfs]</font></div><div><font face="Courier New">[ 7201.724425] [<ffffffffc1c8e5c8>] osp_sync_thread+0xb78/0xb80 [osp]</font></div><div><font face="Courier New">[ 7201.726873] [<ffffffffbd98899f>] ? __schedule+0x3af/0x860</font></div><div><font face="Courier New">[ 7201.729286] [<ffffffffc1c8da50>] ? osp_sync_process_committed+0x700/0x700 [osp]</font></div><div><font face="Courier New">[ 7201.731672] [<ffffffffbd2c5da1>] kthread+0xd1/0xe0</font></div><div><font face="Courier New">[ 7201.734016] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40</font></div><div><font face="Courier New">[ 7201.736329] [<ffffffffbd995df7>] ret_from_fork_nospec_begin+0x21/0x21</font></div><div><font face="Courier New">[ 7201.738617] [<ffffffffbd2c5cd0>] ? insert_kthread_work+0x40/0x40</font></div></div><div><br>
<br></div></div></div></blockquote></div><br></body></html>