[lustre-discuss] Wrong --index set for OST

Wed Sep 27 03:32:44 PDT 2017

Dear All,

Many thanks to those who offered help.

I completed rebuilding the index values for all the OSTs this morning. 
The procedure I followed was:

0. I had the MDS mounted. OSTs all mounted as ldiskfs on each of their 
OSSs. I also had a client mounted on a server separate to the MDS/OSTs.
1. On the client machine:
lctl find --ost ## --stripe-count=1  /exports/terra | grep <something 
useful>
This allowed me to find a file specific for an OST [index value] that I 
had some possibility to confirm by looking at the content. I used readme 
type files, source code, job submission scripts etc.
2. On the client machine take the known file and run:
lctl getstripe <some file>
This returned confirmation of the obdidx plus the object id - objid.
3. Take the objid for the file and on each OST. [I have only three at 
present]:
locate -b '\<objid>'
Previously I made an mlocate db on each OSS for all the OSTs.
This gives a list of of object files for the candidate file.

Because the OSTs are mounted by their LUN the above steps trace a path 
from where the MDS thinks the index is for a file and the candidate LUN 
for the file. This can be used to make a tunefs script with something like:

tunefs.lustre --erase-params --mgsnode=10.0.70.1 at tcp1 --writeconf 
--fsname=terra --index=##  /dev/terra/lunlocation

for each lun.

4. I ran tunefs eraseparams/writeconf on the MDS then on each OSS.

5. Mounted the MDS. Waited a few minutes till logs had settled. Mounted 
each of the OSTs on their respective OSS.

Here I encountered and error where the mount process reports a file not 
found. I think the origin of this is that the last_rcvd value and the 
new index value clash in some way that I do not fully understand. An 
additional problem is that files in the quota_slave directory can also 
cause the mount to fail. I had to remount these OSTs as ldiskfs, mv the 
offending files to scratch and remount.

One more issue surfaced on some OSTs. During the mount a 'invalid 
parameter' message was returned on a few luns. This is caused - I think 
- by erasing the quota_slave files under ldiskfs. I used a e2fsck -v -f 
-y on the lun which saved 2 to 4 files in lost_found. After erasing 
these the lun would mount as type lustre.

6. For our setup - iscsi based - it is essential to specify -o 
max_sectors_kb=2048 in the mount. If this is not done then multipath 
keeps failing paths with various errors that are somewhat misleading. 
The most common being Reason code 0x5. If this happens the system is 
extremely slow. I think that the underlying cause is that our array - 
Dell MD3200i - returns a 0 when queried for this parameter and the 
kernel defaults max_hw_sectors_kb to 32767. Lustre then sets 
max_sectors_kb to 16k and things go horribly wrong. I believe the 
ability control this came along in 2.10 and is likely why we have never 
succeeded in previous upgrade attempts.

At present I have the system up. However all is not over yet! I think 
because I had started an lfsck there are still issues. I can see files 
that were previously listed as ?????? but ls in some portion of the tree 
takes forever to come back - if it ever does - with messages like the 
following being logged:

Sep 27 12:02:24 oss2l210 kernel: Lustre: terra-OST0025: trigger OI scrub 
by RPC for the [0x100250000:0x0:0x0] with flags 0x4a, rc = 0
Sep 27 12:02:24 oss2l210 kernel: Lustre: Skipped 43 previous similar 
messages
Sep 27 12:09:03 oss2l210 kernel: LustreError: 
3053:0:(ofd_dev.c:1592:ofd_create_hdl()) terra-OST001d: Can't find FID 
Sequence 0x0: rc = -115
Sep 27 12:09:03 oss2l210 kernel: LustreError: 
3053:0:(ofd_dev.c:1592:ofd_create_hdl()) Skipped 8965 previous similar 
messages
Sep 27 12:12:35 oss2l210 kernel: Lustre: terra-OST0025: trigger OI scrub 
by RPC for the [0x100250000:0x0:0x0] with flags 0x4a, rc = 0
Sep 27 12:12:35 oss2l210 kernel: Lustre: Skipped 36 previous similar 
messages
Sep 27 12:19:03 oss2l210 kernel: LustreError: 
3051:0:(ofd_dev.c:1592:ofd_create_hdl()) terra-OST000a: Can't find FID 
Sequence 0x0: rc = -115
Sep 27 12:19:03 oss2l210 kernel: LustreError: 
3051:0:(ofd_dev.c:1592:ofd_create_hdl()) Skipped 8967 previous similar 
messages
Sep 27 12:22:48 oss2l210 kernel: Lustre: terra-OST0025: trigger OI scrub 
by RPC for the [0x100250000:0x0:0x0] with flags 0x4a, rc = 0
Sep 27 12:22:48 oss2l210 kernel: Lustre: Skipped 38 previous similar 
messages

Am I right in thinking that starting a fresh lfsck:

lctl lfsck_start --reset --all

will help with this?

Sorry for all the detail! Hope it may be of help to someone else.

Regards,
Rodger

On 26/09/2017 19:38, Dilger, Andreas wrote:
> On Sep 26, 2017, at 07:35, Ben Evans <bevans at cray.com> wrote:
>>
>> I'm guessing on the osts, but what you'd want to do is to find files that
>> are striped to a single OST using "lfs getstripe".  You'll need one file
>> per OST.
>>
>> After that, you'll have to do something like iterate through the OSTs to
>> find the right combo where an ls -l works for that file.  Keep track of
>> what OST indexes map to what devices, because you'll be destroying them
>> pretty constantly until you resolve all of them.
> 
> I don't think you need to iterate through the configuration each time,
> which would take ages to do.  Rather, just do the "lfs getstripe" on a
> few files, and then find which OSTs have object IDs (under the O/0/d*
> directories) that match the required index.
> 
> Essentially, just make a NxN grid of "current index" vs "actual index"
> and then start crossing out boxes when the "lfs getstripe" returns an
> OST object that doesn't actually exist on the OST (assuming the LFSCK
> run didn't mess that up too badly).
> 
>> Each time you change an OST index, you'll need to do tunefs.lustre
>> --writeconf on *all* devices to make them register with the MGS again.
>>
>> -Ben Evans
>>
>> On 9/26/17, 1:08 AM, "lustre-discuss on behalf of rodger"
>> <lustre-discuss-bounces at lists.lustre.org on behalf of
>> rodger at csag.uct.ac.za> wrote:
>>
>>> Dear All,
>>>
>>> Apologies for nagging on this!
>>>
>>> Does anyone have any insight on assessing progress of the lfsck?
>>>
>>> Does anyone have experience of fixing incorrect index values on OST?
>>>
>>> Regards,
>>> Rodger
>>>
>>> On 25/09/2017 11:21, rodger wrote:
>>>> Dear All,
>>>>
>>>> I'm still struggling with this. I am running an lfsck -A at present.
>>>> The
>>>> status update is reporting:
>>>>
>>>> layout_mdts_init: 0
>>>> layout_mdts_scanning-phase1: 1
>>>> layout_mdts_scanning-phase2: 0
>>>> layout_mdts_completed: 0
>>>> layout_mdts_failed: 0
>>>> layout_mdts_stopped: 0
>>>> layout_mdts_paused: 0
>>>> layout_mdts_crashed: 0
>>>> layout_mdts_partial: 0
>>>> layout_mdts_co-failed: 0
>>>> layout_mdts_co-stopped: 0
>>>> layout_mdts_co-paused: 0
>>>> layout_mdts_unknown: 0
>>>> layout_osts_init: 0
>>>> layout_osts_scanning-phase1: 0
>>>> layout_osts_scanning-phase2: 12
>>>> layout_osts_completed: 0
>>>> layout_osts_failed: 30
>>>> layout_osts_stopped: 0
>>>> layout_osts_paused: 0
>>>> layout_osts_crashed: 0
>>>> layout_osts_partial: 0
>>>> layout_osts_co-failed: 0
>>>> layout_osts_co-stopped: 0
>>>> layout_osts_co-paused: 0
>>>> layout_osts_unknown: 0
>>>> layout_repaired: 82358851
>>>> namespace_mdts_init: 0
>>>> namespace_mdts_scanning-phase1: 1
>>>> namespace_mdts_scanning-phase2: 0
>>>> namespace_mdts_completed: 0
>>>> namespace_mdts_failed: 0
>>>> namespace_mdts_stopped: 0
>>>> namespace_mdts_paused: 0
>>>> namespace_mdts_crashed: 0
>>>> namespace_mdts_partial: 0
>>>> namespace_mdts_co-failed: 0
>>>> namespace_mdts_co-stopped: 0
>>>> namespace_mdts_co-paused: 0
>>>> namespace_mdts_unknown: 0
>>>> namespace_osts_init: 0
>>>> namespace_osts_scanning-phase1: 0
>>>> namespace_osts_scanning-phase2: 0
>>>> namespace_osts_completed: 0
>>>> namespace_osts_failed: 0
>>>> namespace_osts_stopped: 0
>>>> namespace_osts_paused: 0
>>>> namespace_osts_crashed: 0
>>>> namespace_osts_partial: 0
>>>> namespace_osts_co-failed: 0
>>>> namespace_osts_co-stopped: 0
>>>> namespace_osts_co-paused: 0
>>>> namespace_osts_unknown: 0
>>>> namespace_repaired: 68265278
>>>>
>>>> with the layout_repaired and namespace_repaired values ticking up at
>>>> about 10000 per second.
>>>>
>>>> Is the layout_osts_failed value of 30 a concern?
>>>>
>>>> Is there any way to know how far along it is?
>>>>
>>>> I am also seeing many messages similar to the following in
>>>> /var/log/messages on the mdt and oss with OST0000:
>>>>
>>>> Sep 25 10:48:00 mds0l210 kernel: LustreError:
>>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans())
>>>> terra-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -22
>>>> Sep 25 10:48:00 mds0l210 kernel: LustreError:
>>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped
>>>> 599
>>>> previous similar messages
>>>> Sep 25 10:48:30 mds0l210 kernel: LustreError:
>>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT0000:
>>>> Cannot
>>>> find sequence 0x8: rc = -2
>>>> Sep 25 10:48:30 mds0l210 kernel: LustreError:
>>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 16593 previous
>>>> similar messages
>>>> Sep 25 10:58:01 mds0l210 kernel: LustreError:
>>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans())
>>>> terra-OST0000-osc-MDT0000: cannot cleanup orphans: rc = -22
>>>> Sep 25 10:58:01 mds0l210 kernel: LustreError:
>>>> 5934:0:(osp_precreate.c:903:osp_precreate_cleanup_orphans()) Skipped
>>>> 599
>>>> previous similar messages
>>>> Sep 25 10:58:57 mds0l210 kernel: LustreError:
>>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) srv-terra-MDT0000:
>>>> Cannot
>>>> find sequence 0x8: rc = -2
>>>> Sep 25 10:58:57 mds0l210 kernel: LustreError:
>>>> 6137:0:(fld_handler.c:256:fld_server_lookup()) Skipped 40309 previous
>>>> similar messages
>>>>
>>>> Do these indicate that the process is not working?
>>>>
>>>> Regards,
>>>> Rodger
>>>>
>>>>
>>>>
>>>> On 23/09/2017 15:07, rodger wrote:
>>>>> Dear All,
>>>>>
>>>>> In the process of upgrading 1.8.x to 2.x I've messed up a number of
>>>>> the index values for OSTs by running tune2fs with the --index value
>>>>> set. To compound matters while trying to get the OSTs to mount I
>>>>> erased the last_rcvd files on the OSTs. I'm looking for a way to
>>>>> confirm what the index should be for each device. Part of the reason
>>>>> for my difficulty is that in the evolution of the filesystem some OSTs
>>>>> were decommissioned and so the full set no longer has a sequential set
>>>>> of index values. In practicing for the upgrade the trial sets that I
>>>>> created did have nice neat sequential indexes and the process I
>>>>> developed broke when I used the real data. :-(
>>>>>
>>>>> The result is that although the lustre filesystem mounts and all
>>>>> directories appear to be listed files in directories mostly have
>>>>> question marks for attributes and are not available for access. I'm
>>>>> assuming this is because the index for the OST holding the file is
>>>>> wrong.
>>>>>
>>>>> Any pointers to recovery would be much appreciated!
>>>>>
>>>>> Regards,
>>>>> Rodger
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation
> 
> 
> 
> 
> 
> 
>