<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Thank you all  for your valuable information.</div><div>We survived and about 1 million files survived. At the first time I wanted get recovery professional under our support contract, but it's not possible to get the right guy in the right time.</div><div>So we had to do it on our own, roughly following the procedure Adrian mentioned, but we still felt risky and we needed good luck, now I feel that I do not want to do this ever again.</div><div><br></div><div>For your information,</div><div>dd_rescue showed that about 4MB at the almost end of the disk had bad sector. It took about 20 hrs to run for 1 TB SATA disk, we ran this on an OSS whose load was relatively small.</div><div><br></div><div>After inserting the fresh one into the original oss(oss07) in question, we found that mdadm with " -A --force" could assemble it with some errors, and it's state was  "active, degraded, Not Started", and we had to use the following to start and resync it.</div><div>echo "clean" > /sys/block/md12/md/array_state   </div><div>I didn't know other method to start it.</div><div><br></div><div>At the 1st try, we failed and two disks fell into faulty, maybe because at that times (we had a periodic maintenance), we rebooted the pair OSS node(oss08) to patch the lustre kernel(1.8.5), raid5 one-line fix which was mentioned by Kevin before.</div><div>For the next try, I updated the raid5 patched lustre kernel on oss07 and just power-cycled the jbod(J4400) and oss07 and then we made it without any error while resyncing and we found that just only 2 inodes were stale by running e2fsck.</div><div><br></div><div>Thank you also for the detailed information why we need periodic scrubbing.</div><div><br></div><div><div><font class="Apple-style-span">Taeyoung Hong</font></div><div>Senior Researcher</div><div><font class="Apple-style-span">Supercomputing Center, KISTI </font></div><div><div><br></div><div>2012. 5. 8., 오전 4:24, Mark Hahn 작성:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><blockquote type="cite">I'd also recommend to start periodic scrubbing: We do this once per month<br></blockquote><blockquote type="cite">with low priority (~5MBPS) with little impact to the users.<br></blockquote><br>yes.  and if you think a rebuild might overstress marginal disks,<br>throttling via the dev.raid.speed_limit_max sysctl can help.<br>_______________________________________________<br>Lustre-discuss mailing list<br><a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>http://lists.lustre.org/mailman/listinfo/lustre-discuss<br></div></blockquote></div><br></div></body></html>