[lustre-discuss] Resolving a stuck OI scrub thread

Wed Apr 20 11:36:38 PDT 2022

Back in March I wrote here about our corrupt file system
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-March/018007.html).
We are still trying to fix it.  Since that time we have found all (we
think) of the files that hang when they are accessed, and unlinked them
from the filesystem.  We ran an lfsck which seemed to do a lot for half
a day but has now gone quiet again, but hasn't stopped.

[root at aocmds ~]# lctl lfsck_query | grep -v ': 0$'
layout_mdts_scanning-phase1: 1
layout_osts_scanning-phase2: 48
layout_repaired: 609777
namespace_mdts_scanning-phase1: 1
namespace_repaired: 9
[root at aocmds ~]# 

We are still getting a lot of errors on the OSS that has the corrupt
filesystem.  Most of the OSTs have 20 or fewer destroys_in_flight, but
the corrupt on has 1899729.  It also produces a lot of syslog messages.
Most look like this:

Apr 20 12:20:16 aocoss04 kernel: Lustre: 64377:0:(osd_scrub.c:186:osd_scrub_refresh_mapping()) aoclst03-OST000e: fail to refresh OI map for scrub op 2 [0x100000000:0x1233bbf:0x0] => 1750088/1273942946: rc = -17

But those look like a symptom to me, not a cause.  I think the cause is
buried in these messages:

Apr 20 12:20:56 aocoss04 kernel: Lustre: 64701:0:(osd_scrub.c:767:osd_scrub_post()) sdd: OI scrub post, result = 1
Apr 20 12:20:56 aocoss04 kernel: Lustre: 64701:0:(osd_scrub.c:1551:osd_scrub_main()) sdd: OI scrub: stop, pos = 45780993: rc = 1
Apr 20 12:20:56 aocoss04 kernel: Lustre: 64908:0:(osd_scrub.c:669:osd_scrub_prep()) sdd: OI scrub prep, flags = 0x4e
Apr 20 12:20:56 aocoss04 kernel: Lustre: 64908:0:(osd_scrub.c:279:osd_scrub_file_reset()) sdd: reset OI scrub file, old flags = 0x0, add flags = 0x0
Apr 20 12:20:56 aocoss04 kernel: Lustre: 64908:0:(osd_scrub.c:1541:osd_scrub_main()) sdd: OI scrub start, flags = 0x4e, pos = 12

Digging into the source code, it looks like osd_scrub_post() is
dicovering that a thread exists for doing the scrub, and so it aborts.
Each time a file on that OST is removed it looks like the
destroys_in_flight is incremented.  My best guess is that the thread is
hung.  I want to try stopping the lfsck (which doesn't seem to have
done anything in a little over twelve hours), then reboot the OSS to
clear that kernel thread, then restart the lfsck to try again.

lustre-2.10.8/lustre/osd-ldiskfs/osd_scrub.c
       if (!scrub->os_full_speed && !scrub->os_partial_scan) {
                struct l_wait_info lwi = { 0 };
                struct osd_otable_it *it = dev->od_otable_it;
                struct osd_otable_cache *ooc = &it->ooi_cache;

                l_wait_event(thread->t_ctl_waitq,
                             it->ooi_user_ready || !thread_is_running(thread),
                             &lwi);
                if (unlikely(!thread_is_running(thread)))
                        GOTO(post, rc = 0);

One problem we did have that we don't want to repeat was that the layout
part of lfsck was chowning files yesterday and we had a lot of cluster
jobs fail because they started to get permission denied on their files.
The logs say that files were chowned from root to the user, which sounds
like user jobs should have been failing before the lfsck and working
after, but the errors happened during the part of run when these logs
were being generated.

Apr 19 14:35:07 aocmds kernel: Lustre: 126605:0:(lfsck_layout.c:3906:lfsck_layout_repair_owner()) aoclst03-MDT0000-osd: layout LFSCK assistant repaired inconsistent file owner for: parent [0x20000b669:0xc03f:0x0], child [0x100040000:0x3220d02:0x0], OST-index 4, stripe-index 0, old owner 0/0, new owner 5916/335: rc = 1

Does anyone have any advice about a) stopping lfsck rebooting the ost
and restarting the lfask to try and clear the hung thread and start
processing the destroys_inflight, and b) if the
lfsck_layout_repair_owner() is likely to run again or have we probably
resolved those issues?

--Schlake
  Sysadmin IV, NRAO
  Work: 575-835-7281 (BACK IN THE OFFICE!)
  Cell: 575-517-5668 (out of work hours)