[lustre-discuss] lfsck oi_scrub failed counts
jsld at 1up.unl.edu
Thu Dec 20 14:50:26 PST 2018
We're looking for suggestions on how to interpret the status output of
the various stages of the 'lctl lfsck_start' command. In particular
the oi_scrub failed counts.
The manual states the following in the 'LFSCK status of OI Scrub' section:
'Failed - total number of objects that failed to be repaired.'
A recent 'lctl lfsck_start -M Name-MDT0000' to verify OI, layout and
namespace reported high failed counts on the oi_scrub for all of our
OSTs in the FS. This was unexpected. We were running an online lfsck
because we had a single OST go read-only whilst the underlying RAID6
hardware was rebuilding a disk and had a long period of not responding
to I/O (I'll spare this tale of woe). The resulting e2fsck'd OST had 6
zero sized, trusted.lma extended attribute containing "Unattached inode"
that were routed manually to /lost+found. These 6 inodes showed up in
the /proc/fs/lustre/osd-ldiskfs/<OST>/oi_scrub file:
However this and the 49 other OSTs also showed 'failed:' counts in
oi_scrub, ranging between ~45000 and ~50200 for the low and high end of
the ranges respectively, a snippet of the OST having the above lf_* counts:
All of the OST oi_scrub status files had the following:
All the OSSs have the following default debug settings:
lctl get_param debug
debug=ioctl neterror warning error emerg ha config console lfsck
Performing a 'lctl debug_kernel dk.txt' on the OSSs and looking for LFSCK
subsystem/debug_mask lines appearing to be involved with scrub activities
were _much_ smaller than the failed counts. The scrub LFSCK debug lines
looked similar to the following:
00100000:10000000:17.0:1545248778.311791:0:13132:0:(osd_scrub.c:454:osd_scrub_convert_ff()) Name-OST002a-osd: fail to convert ff [0x100000000:0xb0:0x0]: rc = -17
and I assume -17 is -EEXIST.
Should we be concerned about these failed counts? If so, how do we match
failed counts in LFSCK status output to Lustre debug lines so we can find
the cause and try to resolve the problem?
We're running Lustre 2.8.0 on the servers and clients in case that matters
Thank you in advance for any wisdom you can share,
More information about the lustre-discuss