[lustre-discuss] [EXTERNAL] Accessing files with bad PFL causing MDS kernel panics

Nathan Crawford nrcrawfo at uci.edu
Tue Oct 25 15:56:04 PDT 2022


Hi Rick,

  I did attempt that, and while subsequent access didn't cause an MDS
panic, the client threw errors like "cannot get group lock: Invalid
argument (22)".

  I'm going to attempt the patch and workaround from LU-16194 suggested by
Andreas a couple hours ago on the LU-16152 bug report.

  My guess is that normal people set the PFL components directly as
arguments to lfs setstripe, or reference an existing file's PFL with
--copy. Both of those methods work fine, but I took the fancy yaml route.

Thanks,
Nate



On Tue, Oct 25, 2022 at 2:51 PM Mohr, Rick <mohrrf at ornl.gov> wrote:

> Nate,
>
> For the example layout you attached, it looks like the file does not have
> any data in the components with the messed up extent_end value.  Have you
> tried using "lfs setstripe --component-del" to delete just those messed up
> components and see if you can then access the data?
>
> --Rick
>
>
> On 10/25/22, 4:43 PM, "lustre-discuss on behalf of Nathan Crawford" <
> lustre-discuss-bounces at lists.lustre.org on behalf of nrcrawfo at uci.edu>
> wrote:
>
>     Hi All,
>       I'm looking for possible work-arounds to recover data from some
> mis-migrated files (as seen in  LU-16152). Basically, there's a bug in "lfs
> setstripe --yaml" where extent start/end values in the yaml file >= 2GiB
> overflow to 16 EiB - 2 GiB.
>
>       Using lfs_migrate, I re-striped many files in directories with a
> default striping pattern containing these values.  I'm pretty sure that the
> data exists (was trying to purge an older OST, and disk usage on the other
> OSTs increased as the purged OST decreased), and an lfsck procedure happily
> returns after a day or so. Unfortunately, attempts to access or re-migrate
> the files triggers a kernel panic on the MDS with:
>
>     LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION(
> !((unsigned long)addr & ~(~(((1UL) << 12)-1))) ) failed:
>     LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG
>
>     Kernel panic - not syncing: LBUG
>
>
>      The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The
> output from "lfs getstripe -v badfile" is attached.
>
>       I can use lfs find to search for files with these bad extent
> endpoint values, then move them to a quarantine area on the same FS. This
> will allow the rest of the system to stay up (hopefully) but recovering the
> data is still needed.
>
>     Thanks!
>     Nate
>
>     --
>     Dr. Nathan Crawford              nathan.crawford at uci.edu
>     Director of Scientific Computing
>     School of Physical Sciences
>     164 Rowland Hall                 Office: 152 Rowland Hall
>     University of California, Irvine  Phone: 949-824-1380
>     Irvine, CA 92697-2025, USA
>
>

-- 

Dr. Nathan Crawford              nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall                 Office: 152 Rowland Hall
University of California, Irvine  Phone: 949-824-1380
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221025/93a36db1/attachment.htm>


More information about the lustre-discuss mailing list