[Lustre-discuss] recovery from multiple disks failure on the same md

Mon May 7 12:17:38 PDT 2012

Hi,

> A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While recovering it, another disk failed. so recovering procedure seems to be halt,

So did the md-array stop itself on the 3th disk failure (or at least turn read-only)?

If it did you might be able to get it running again without catastrophic corruption.

This is what i would try (without any warranty!):

 -> Forget about the 2 syncing spares

 -> Take the 3th failed disk and attach it to some pc

 -> Copy as much data as possible to a new spare using dd_rescue
    (-r might help)

 -> Put the drive with the fresh copy (= the good, new drive) into the array and assemble + start it.
    Use --force if mdadm complains about outdated metadata.
    (and starting it as 'readonly' for now would also be a good idea)

 -> Add a new spare to the array and sync it as fast as possible to get at least 1 parity disk.

 -> Run 'fsck -n /dev/mdX' to see how badly damaged your filesystem is.
    If you think that fsck can fix the errors (and will not cause more damadge), run it without '-n'

 -> Add the 2nd parity disk, sync it, mount the filesystem and pray.

The amount of data corruption will be linked to the success of dd_rescue: You are probably lucky if it only failed to read a few sectors.

And i agree with Kevin:

If you have a support contract: ask them to fix it.
(..and if you have enough hardware + time: create a backup of ALL drives in the failed raid via 'dd' before touching anything!)

I'd also recommend to start periodic scrubbing: We do this once per month with low priority (~5MBPS) with little impact to the users.

Regards and good luck,
 Adrian