[lustre-discuss] lfsck oi_scrub failed counts

Josh Samuelson jsld at 1up.unl.edu
Thu Dec 20 14:50:26 PST 2018


Greetings,

We're looking for suggestions on how to interpret the status output of
the various stages of the 'lctl lfsck_start' command.  In particular
the oi_scrub failed counts.

The manual states the following in the 'LFSCK status of OI Scrub' section:
'Failed - total number of objects that failed to be repaired.'
 
A recent 'lctl lfsck_start -M Name-MDT0000' to verify OI, layout and
namespace reported high failed counts on the oi_scrub for all of our
OSTs in the FS.  This was unexpected.  We were running an online lfsck
because we had a single OST go read-only whilst the underlying RAID6
hardware was rebuilding a disk and had a long period of not responding
to I/O (I'll spare this tale of woe).  The resulting e2fsck'd OST had 6
zero sized, trusted.lma extended attribute containing "Unattached inode"
that were routed manually to /lost+found.  These 6 inodes showed up in
the /proc/fs/lustre/osd-ldiskfs/<OST>/oi_scrub file:

lf_scanned: 6
lf_repaired: 6

However this and the 49 other OSTs also showed 'failed:' counts in
oi_scrub, ranging between ~45000 and ~50200 for the low and high end of
the ranges respectively, a snippet of the OST having the above lf_* counts:

first_failure_position: 87
checked: 1784231
updated: 327
failed: 47725
prior_updated: 0
noscrub: 225
igif: 1
success_count: 1

All of the OST oi_scrub status files had the following:
first_failure_position: 87

All the OSSs have the following default debug settings:

lctl get_param debug
debug=ioctl neterror warning error emerg ha config console lfsck

Performing a 'lctl debug_kernel dk.txt' on the OSSs and looking for LFSCK
subsystem/debug_mask lines appearing to be involved with scrub activities
were _much_ smaller than the failed counts.  The scrub LFSCK debug lines
looked similar to the following:

00100000:10000000:17.0:1545248778.311791:0:13132:0:(osd_scrub.c:454:osd_scrub_convert_ff()) Name-OST002a-osd: fail to convert ff [0x100000000:0xb0:0x0]: rc = -17

and I assume -17 is -EEXIST.

Should we be concerned about these failed counts?  If so, how do we match
failed counts in LFSCK status output to Lustre debug lines so we can find
the cause and try to resolve the problem?

We're running Lustre 2.8.0 on the servers and clients in case that matters
at all.

Thank you in advance for any wisdom you can share,
-Josh


More information about the lustre-discuss mailing list