[lustre-discuss] [EXTERNAL] Accessing files with bad PFL causing MDS kernel panics
Nathan Crawford
nrcrawfo at uci.edu
Tue Oct 25 15:56:04 PDT 2022
Hi Rick,
I did attempt that, and while subsequent access didn't cause an MDS
panic, the client threw errors like "cannot get group lock: Invalid
argument (22)".
I'm going to attempt the patch and workaround from LU-16194 suggested by
Andreas a couple hours ago on the LU-16152 bug report.
My guess is that normal people set the PFL components directly as
arguments to lfs setstripe, or reference an existing file's PFL with
--copy. Both of those methods work fine, but I took the fancy yaml route.
Thanks,
Nate
On Tue, Oct 25, 2022 at 2:51 PM Mohr, Rick <mohrrf at ornl.gov> wrote:
> Nate,
>
> For the example layout you attached, it looks like the file does not have
> any data in the components with the messed up extent_end value. Have you
> tried using "lfs setstripe --component-del" to delete just those messed up
> components and see if you can then access the data?
>
> --Rick
>
>
> On 10/25/22, 4:43 PM, "lustre-discuss on behalf of Nathan Crawford" <
> lustre-discuss-bounces at lists.lustre.org on behalf of nrcrawfo at uci.edu>
> wrote:
>
> Hi All,
> I'm looking for possible work-arounds to recover data from some
> mis-migrated files (as seen in LU-16152). Basically, there's a bug in "lfs
> setstripe --yaml" where extent start/end values in the yaml file >= 2GiB
> overflow to 16 EiB - 2 GiB.
>
> Using lfs_migrate, I re-striped many files in directories with a
> default striping pattern containing these values. I'm pretty sure that the
> data exists (was trying to purge an older OST, and disk usage on the other
> OSTs increased as the purged OST decreased), and an lfsck procedure happily
> returns after a day or so. Unfortunately, attempts to access or re-migrate
> the files triggers a kernel panic on the MDS with:
>
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION(
> !((unsigned long)addr & ~(~(((1UL) << 12)-1))) ) failed:
> LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG
>
> Kernel panic - not syncing: LBUG
>
>
> The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The
> output from "lfs getstripe -v badfile" is attached.
>
> I can use lfs find to search for files with these bad extent
> endpoint values, then move them to a quarantine area on the same FS. This
> will allow the rest of the system to stay up (hopefully) but recovering the
> data is still needed.
>
> Thanks!
> Nate
>
> --
> Dr. Nathan Crawford nathan.crawford at uci.edu
> Director of Scientific Computing
> School of Physical Sciences
> 164 Rowland Hall Office: 152 Rowland Hall
> University of California, Irvine Phone: 949-824-1380
> Irvine, CA 92697-2025, USA
>
>
--
Dr. Nathan Crawford nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 152 Rowland Hall
University of California, Irvine Phone: 949-824-1380
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221025/93a36db1/attachment.htm>
More information about the lustre-discuss
mailing list