[Lustre-discuss] recovery from multiple disks failure on the same md
Adrian Ulrich
adrian at blinkenlights.ch
Mon May 7 12:17:38 PDT 2012
Hi,
> A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While recovering it, another disk failed. so recovering procedure seems to be halt,
So did the md-array stop itself on the 3th disk failure (or at least turn read-only)?
If it did you might be able to get it running again without catastrophic corruption.
This is what i would try (without any warranty!):
-> Forget about the 2 syncing spares
-> Take the 3th failed disk and attach it to some pc
-> Copy as much data as possible to a new spare using dd_rescue
(-r might help)
-> Put the drive with the fresh copy (= the good, new drive) into the array and assemble + start it.
Use --force if mdadm complains about outdated metadata.
(and starting it as 'readonly' for now would also be a good idea)
-> Add a new spare to the array and sync it as fast as possible to get at least 1 parity disk.
-> Run 'fsck -n /dev/mdX' to see how badly damaged your filesystem is.
If you think that fsck can fix the errors (and will not cause more damadge), run it without '-n'
-> Add the 2nd parity disk, sync it, mount the filesystem and pray.
The amount of data corruption will be linked to the success of dd_rescue: You are probably lucky if it only failed to read a few sectors.
And i agree with Kevin:
If you have a support contract: ask them to fix it.
(..and if you have enough hardware + time: create a backup of ALL drives in the failed raid via 'dd' before touching anything!)
I'd also recommend to start periodic scrubbing: We do this once per month with low priority (~5MBPS) with little impact to the users.
Regards and good luck,
Adrian
More information about the lustre-discuss
mailing list