[Lustre-discuss] recovery from multiple disks failure on the same md
Tae Young Hong
catchrye at gmail.com
Thu May 10 02:24:50 PDT 2012
Thank you all for your valuable information.
We survived and about 1 million files survived. At the first time I wanted get recovery professional under our support contract, but it's not possible to get the right guy in the right time.
So we had to do it on our own, roughly following the procedure Adrian mentioned, but we still felt risky and we needed good luck, now I feel that I do not want to do this ever again.
For your information,
dd_rescue showed that about 4MB at the almost end of the disk had bad sector. It took about 20 hrs to run for 1 TB SATA disk, we ran this on an OSS whose load was relatively small.
After inserting the fresh one into the original oss(oss07) in question, we found that mdadm with " -A --force" could assemble it with some errors, and it's state was "active, degraded, Not Started", and we had to use the following to start and resync it.
echo "clean" > /sys/block/md12/md/array_state
I didn't know other method to start it.
At the 1st try, we failed and two disks fell into faulty, maybe because at that times (we had a periodic maintenance), we rebooted the pair OSS node(oss08) to patch the lustre kernel(1.8.5), raid5 one-line fix which was mentioned by Kevin before.
For the next try, I updated the raid5 patched lustre kernel on oss07 and just power-cycled the jbod(J4400) and oss07 and then we made it without any error while resyncing and we found that just only 2 inodes were stale by running e2fsck.
Thank you also for the detailed information why we need periodic scrubbing.
Supercomputing Center, KISTI
2012. 5. 8., 오전 4:24, Mark Hahn 작성:
>> I'd also recommend to start periodic scrubbing: We do this once per month
>> with low priority (~5MBPS) with little impact to the users.
> yes. and if you think a rebuild might overstress marginal disks,
> throttling via the dev.raid.speed_limit_max sysctl can help.
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss