[Lustre-discuss] recovery from multiple disks failure on the same md

Tae Young Hong catchrye at gmail.com
Sun May 6 21:13:22 PDT 2012


Hi,

I found the terrible situation on our lustre system.
A OST (raid 6: 8+2, spare 1) had 2 disk failures almost at the same time. While recovering it, another disk failed. so recovering procedure seems to be halt, and the spare disk which were in resync fell into "spare" status again. (I guess that resync procedure almost finished more than 95%)  
Right now we have just 7 disks for this md. Is there any possibility to recover from this situation?


The following is detailed log.
#1 the original configuration before any failure

     Number   Major   Minor   RaidDevice State
       0       8      176        0      active sync   /dev/sdl
       1       8      192        1      active sync   /dev/sdm
       2       8      208        2      active sync   /dev/sdn
       3       8      224        3      active sync   /dev/sdo
       4       8      240        4      active sync   /dev/sdp
       5      65        0        5      active sync   /dev/sdq
       6      65       16        6      active sync   /dev/sdr
       7      65       32        7      active sync   /dev/sds
       8      65       48        8      active sync   /dev/sdt
       9      65       96        9      active sync   /dev/sdw

      10      65       64        -      spare   /dev/sdu

#2 a disk(sdl) failed, and resync started after adding spare disk(sdu)
May  7 04:53:33 oss07 kernel: sd 1:0:10:0: SCSI error: return code = 0x08000002
May  7 04:53:33 oss07 kernel: sdl: Current: sense key: Medium Error
May  7 04:53:33 oss07 kernel:     Add. Sense: Unrecovered read error
May  7 04:53:33 oss07 kernel:
May  7 04:53:33 oss07 kernel: Info fld=0x74241ace
May  7 04:53:33 oss07 kernel: end_request: I/O error, dev sdl, sector 1948523214
... ...
May  7 04:54:15 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: RAID5 conf printout:
May  7 04:54:16 oss07 kernel:  --- rd:10 wd:9 fd:1
May  7 04:54:16 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 04:54:16 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:16 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:16 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:16 oss07 kernel:  disk 4, o:1, dev:sdp
May  7 04:54:16 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:16 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:16 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:16 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:16 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:16 oss07 kernel: md: syncing RAID array md12


#3 another disk(sdp) failed
May  7 04:54:42 oss07 kernel: end_request: I/O error, dev sdp, sector 1949298688
May  7 04:54:42 oss07 kernel: mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ FaCommands After Error}, SubCode(0x0000)
May  7 04:54:42 oss07 last message repeated 3 times
May  7 04:54:42 oss07 kernel: raid5:md12: read error not correctable (sector 1949298688 on sdp).
May  7 04:54:42 oss07 kernel: raid5: Disk failure on sdp, disabling device. Operation continuing on
May  7 04:54:43 oss07 kernel: end_request: I/O error, dev sdp, sector 1948532499
... ...
May  7 04:54:44 oss07 kernel: raid5:md12: read error not correctable (sector 1948532728 on sdp).
May  7 04:54:44 oss07 kernel: md: md12: sync done.
May  7 04:54:53 oss07 kernel: RAID5 conf printout:
May  7 04:54:53 oss07 kernel:  --- rd:10 wd:8 fd:2
May  7 04:54:53 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 04:54:53 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:53 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:53 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:53 oss07 kernel:  disk 4, o:0, dev:sdp
May  7 04:54:53 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:53 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:53 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:53 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:53 oss07 kernel:  disk 9, o:1, dev:sdw
... ...
May  7 04:54:54 oss07 kernel: RAID5 conf printout:
May  7 04:54:54 oss07 kernel:  --- rd:10 wd:8 fd:2
May  7 04:54:54 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:54 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:54 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:54 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:54 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:54 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:54 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:54 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:54 oss07 kernel: RAID5 conf printout:
May  7 04:54:54 oss07 kernel:  --- rd:10 wd:8 fd:2
May  7 04:54:54 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 04:54:54 oss07 kernel:  disk 1, o:1, dev:sdm
May  7 04:54:54 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 04:54:54 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 04:54:54 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 04:54:54 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 04:54:54 oss07 kernel:  disk 7, o:1, dev:sds
May  7 04:54:55 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 04:54:55 oss07 kernel:  disk 9, o:1, dev:sdw
May  7 04:54:55 oss07 kernel: md: syncing RAID array md12

# the 3rd disk(sdm) failed while resyncing
May  7 09:41:53 oss07 kernel: mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
May  7 09:41:57 oss07 kernel: mptbase: ioc1: LogInfo(0x31110e00): Originator={PL}, Code={Reset}, SubCode(0x0e00)
May  7 09:41:59 oss07 last message repeated 24 times
May  7 09:42:04 oss07 kernel: mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
May  7 09:42:34 oss07 last message repeated 43 times
May  7 09:42:34 oss07 kernel: sd 1:0:11:0: SCSI error: return code = 0x000b0000
May  7 09:42:34 oss07 kernel: end_request: I/O error, dev sdm, sector 1948444160
May  7 09:42:34 oss07 kernel: mptbase: ioc1: LogInfo(0x31080000): Originator={PL}, Code={SATA NCQ Fail All Commands After Error}, SubCode(0x0000)
May  7 09:42:34 oss07 last message repeated 3 times
May  7 09:42:34 oss07 kernel: raid5:md12: read error not correctable (sector 1948444160 on sdm).
May  7 09:42:34 oss07 kernel: raid5: Disk failure on sdm, disabling device. Operation continuing on 7 devices
May  7 09:42:34 oss07 kernel: raid5:md12: read error not correctable (sector 1948444168 on sdm).
May  7 09:42:34 oss07 kernel: raid5:md12: read error not correctable (sector 1948444176 on sdm).
... ...
May  7 09:42:49 oss07 kernel:  --- rd:10 wd:7 fd:3
May  7 09:42:49 oss07 kernel:  disk 0, o:1, dev:sdu
May  7 09:42:49 oss07 kernel:  disk 1, o:0, dev:sdm
May  7 09:42:49 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 09:42:49 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 09:42:49 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 09:42:49 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 09:42:49 oss07 kernel:  disk 7, o:1, dev:sds
May  7 09:42:49 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 09:42:49 oss07 kernel:  disk 9, o:1, dev:sdw
... ...
May  7 09:42:58 oss07 kernel: RAID5 conf printout:
May  7 09:42:58 oss07 kernel:  --- rd:10 wd:7 fd:3
May  7 09:42:58 oss07 kernel:  disk 1, o:0, dev:sdm
May  7 09:42:58 oss07 kernel:  disk 2, o:1, dev:sdn
May  7 09:42:58 oss07 kernel:  disk 3, o:1, dev:sdo
May  7 09:42:58 oss07 kernel:  disk 5, o:1, dev:sdq
May  7 09:42:58 oss07 kernel:  disk 6, o:1, dev:sdr
May  7 09:42:58 oss07 kernel:  disk 7, o:1, dev:sds
May  7 09:42:58 oss07 kernel:  disk 8, o:1, dev:sdt
May  7 09:42:58 oss07 kernel:  disk 9, o:1, dev:sdw


# current md status
[root at oss07 ~]# mdadm --detail /dev/md12
/dev/md12:
        Version : 00.90.03
  Creation Time : Mon Oct  4 15:30:53 2010
     Raid Level : raid6
     Array Size : 7814099968 (7452.11 GiB 8001.64 GB)
  Used Dev Size : 976762496 (931.51 GiB 1000.20 GB)
   Raid Devices : 10
  Total Devices : 11
Preferred Minor : 12
    Persistence : Superblock is persistent

  Intent Bitmap : /mnt/scratch/bitmaps/ost02/bitmap

    Update Time : Mon May  7 11:38:51 2012
          State : clean, degraded
 Active Devices : 7
Working Devices : 8
 Failed Devices : 3
  Spare Devices : 1

     Chunk Size : 128K

           UUID : 63eb5b15:294c1354:f0c167bd:f8e81f47
         Events : 0.7382

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8      208        2      active sync   /dev/sdn
       3       8      224        3      active sync   /dev/sdo
       4       0        0        4      removed
       5      65        0        5      active sync   /dev/sdq
       6      65       16        6      active sync   /dev/sdr
       7      65       32        7      active sync   /dev/sds
       8      65       48        8      active sync   /dev/sdt
       9      65       96        9      active sync   /dev/sdw

      10       8      176        -      faulty spare   /dev/sdl
      11      65       64        -      spare   /dev/sdu
      12       8      240        -      faulty spare   /dev/sdp
      13       8      192        -      faulty spare   /dev/sdm


Best regards,

Taeyoung Hong
Senior Researcher
Supercomputing Center of KISTI 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20120507/046a5912/attachment.htm>


More information about the lustre-discuss mailing list