[lustre-discuss] 2.1 MDT recovery on test hardware

Crowe, Tom thcrowe at iu.edu
Thu Jun 25 15:48:47 PDT 2015


Hi Chris,

Thanks for the response. Hopefully I can explain the scenario in greater detail and shed some light on this. 

The MDT is scheduled for replacement with newer/faster/better hardware. The DD based backup’s historically take over 30 hours. LVM snapshots are used to avoid a terribly long outage, so the backup is actually of the LVM snap, and the filesystem is up during the “point in time” DD backup.

The thought I had, was to restore the DD backup, to test gear, wire up an new OST, mount a client, put a small load on the test gear. Then use LVM’s pvmove, to migrate the block devices of the MDT, with the filesystem up, and said client continues to churn data. Basically simulate the MDT migration via pvmove on test gear.

The goal, is to accomplish the MDT migration, avoiding the estimated 40 hour outage of a DD backup/restore.

I have reviewed and followed the lustre 2.x manual, specifically the 14.5 section “changing a server NID”, and all seems well except the client mount. I receive the following error when attempting to mount a client:

mount.lustre: mount 10.10.0.173 at o2ib:/lustre at /lustre/client1 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

The client can lctl ping the MGS/MDT (same node) w/out and issue, and the MGS/MDT can lctl ping the client as well.

The overall test, is to validate the LVM pvmove process. I have migrated many other block based storage with a similar LVM procedure, but never a lustre MDT.

Ultimately, I may need to lobby for an extended outage, and simply incur the 40 hour downtime for the DD backup/restore. But I would really like to understand why the client won’t mount. The test has dovetailed into a mini recovery exercise, which I feel is not complete unless a client can access the filesystem. 

Thank you for your comments. I am open to suggestions.

-Tom

  
> On Jun 25, 2015, at 5:13 PM, Christopher J. Morrone <morrone2 at llnl.gov> wrote:
> 
> I think the major problem is going to be that your MDT image is not terribly useful without the OSTs that belong to the MDT.  The new OSTs don't contain any of the objects that the MDT references.
> 
> Back at old Lustre 2.1 code you won't have any of the lfsck code that can deal with the MDT to OST inconsistency problems.  And even if you did, every file in your filesystem would be removed or moved to lost+found.
> 
> It is not immediately clear to me what amount of useful testing could be done under that situation.  Maybe there is something.
> 
> Chris
> 
> On 06/25/2015 01:06 PM, Crowe, Tom wrote:
>> Greetings,
>> 
>> I am investigating the possibility of restoring a DD backup of our MDT,
>> onto test hardware. Our filesystem is 2.1 based.
>> 
>> The general idea would be to get the MDT/MGS restored in their entirety,
>> change the MGSNODE parameter on the MDT to reflect the test hardware
>> LNET setup, add some new OST’s, have clients mount the new setup, and
>> proceed with our testing.
>> 
>> Is there a procedure that outlines this process? I suspect the exercise
>> could be considered a disaster recovery test, but we do not have any
>> intention at this time to relocate and/or recover any of the original OST’s.
>> 
>> Thank You.
>> 
>> -Tom Crowe
>> 
>> 
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list