[Lustre-discuss] Recovery from Hardware Failure

Joe Digilio jgd-lustre at metajoe.com
Mon Feb 7 13:16:06 PST 2011


Last week we experienced a major hardware failure (disk controller)
that brought down our system hard.  Now that I have the replacement
controller, I want to make sure I recover correctly.  Below is the
procedure I plan to follow based on what I've gathered from the
Operations Manual.

Any comments?
Do I need to create the mds/ost DBs AFTER ll_recover_lost_found_objs?

Thanks!
-Joe


###MDT Recovery
# Capture fs state before doing anything
e2fsck -vfn /dev/$MDTDEV
# "safe" repair
e2fsck -vfp /dev/$MDTDEV
# Verify no more problems and generate mdsdb
e2fsck -vfn --mdsdb /tmp/mdsdb /dev/$MDTDEV

###OST Recovery
foreach OST
    # Capture fs state before doing anything
    e2fsck -vfn /dev/$OSTDEV
    # "safe" repair
    e2fsck -vfp /dev/$OSTDEV
    # Verify no more problems
    e2fsck -vfn --mdsdb /tmp/mdsdb --ostdb /tmp/ostXdb /dev/$OSTDEV

### Recover lost+found Objects
foreach OST
    mount -t ldiskfs /dev/$OSTDEV /mnt/ost
    ll_recover_lost_found_objs -v -d /mnt/ost/lost+found

### Coherency Check
lfsck -n -v --mdsdb /tmp/mdsdb --ostdb
/tmp/ost1db,/tmp/ost2db,...,/tmp/ostNdb /lustre



More information about the lustre-discuss mailing list