[lustre-discuss] Accessing files with bad PFL causing MDS kernel panics
Nathan Crawford
nrcrawfo at uci.edu
Tue Oct 25 13:41:08 PDT 2022
Hi All,
I'm looking for possible work-arounds to recover data from some
mis-migrated files (as seen in LU-16152). Basically, there's a bug in "lfs
setstripe --yaml" where extent start/end values in the yaml file >= 2GiB
overflow to 16 EiB - 2 GiB.
Using lfs_migrate, I re-striped many files in directories with a default
striping pattern containing these values. I'm pretty sure that the data
exists (was trying to purge an older OST, and disk usage on the other OSTs
increased as the purged OST decreased), and an lfsck procedure happily
returns after a day or so. Unfortunately, attempts to access or re-migrate
the files triggers a kernel panic on the MDS with:
LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) ASSERTION( !((unsigned
long)addr & ~(~(((1UL) << 12)-1))) ) failed:
LustreError: 12576:0:(osd_io.c:311:kmem_to_page()) LBUG
Kernel panic - not syncing: LBUG
The servers are lustre 2.12.8 on OpenZFS 0.8.5 on CentOS 7.9. The output
from "lfs getstripe -v badfile" is attached.
I can use lfs find to search for files with these bad extent endpoint
values, then move them to a quarantine area on the same FS. This will allow
the rest of the system to stay up (hopefully) but recovering the data is
still needed.
Thanks!
Nate
--
Dr. Nathan Crawford nathan.crawford at uci.edu
Director of Scientific Computing
School of Physical Sciences
164 Rowland Hall Office: 152 Rowland Hall
University of California, Irvine Phone: 949-824-1380
Irvine, CA 92697-2025, USA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20221025/fc43130d/attachment.htm>
-------------- next part --------------
composite_header:
lcm_magic: 0x0BD60BD0
lcm_size: 552
lcm_flags: 0
lcm_layout_gen: 2
lcm_mirror_count: 1
lcm_entry_count: 5
components:
- lcme_id: 131073
lcme_mirror_id: 2
lcme_flags: init
lcme_extent.e_start: 0
lcme_extent.e_end: 131072
lcme_offset: 272
lcme_size: 32
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20000b836
lmm_object_id: 0x19a
lmm_fid: [0x20000b836:0x19a:0x0]
lmm_stripe_count: 0
lmm_stripe_size: 131072
lmm_pattern: mdt
lmm_layout_gen: 0
lmm_stripe_offset: 0
- lcme_id: 131074
lcme_mirror_id: 2
lcme_flags: init
lcme_extent.e_start: 131072
lcme_extent.e_end: 16777216
lcme_offset: 304
lcme_size: 56
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20000b836
lmm_object_id: 0x19a
lmm_fid: [0x20000b836:0x19a:0x0]
lmm_stripe_count: 1
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 6
lmm_objects:
- 0: { l_ost_idx: 6, l_fid: [0x100060000:0x2de85:0x0] }
- lcme_id: 131075
lcme_mirror_id: 2
lcme_flags: init
lcme_extent.e_start: 16777216
lcme_extent.e_end: 1073741824
lcme_offset: 360
lcme_size: 80
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20000b836
lmm_object_id: 0x19a
lmm_fid: [0x20000b836:0x19a:0x0]
lmm_stripe_count: 2
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 2
lmm_objects:
- 0: { l_ost_idx: 2, l_fid: [0x100020000:0x861394f:0x0] }
- 1: { l_ost_idx: 5, l_fid: [0x100050000:0x12e4858e:0x0] }
- lcme_id: 131076
lcme_mirror_id: 2
lcme_flags: 0
lcme_extent.e_start: 1073741824
lcme_extent.e_end: 18446744071562067968
lcme_offset: 440
lcme_size: 56
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20000b836
lmm_object_id: 0x19a
lmm_fid: [0x20000b836:0x19a:0x0]
lmm_stripe_count: 4
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
- lcme_id: 131077
lcme_mirror_id: 2
lcme_flags: 0
lcme_extent.e_start: 18446744071562067968
lcme_extent.e_end: EOF
lcme_offset: 496
lcme_size: 56
sub_layout:
lmm_magic: 0x0BD10BD0
lmm_seq: 0x20000b836
lmm_object_id: 0x19a
lmm_fid: [0x20000b836:0x19a:0x0]
lmm_stripe_count: 8
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: -1
More information about the lustre-discuss
mailing list