[lustre-discuss] Error on a zpool underlying an OST

Fri Mar 11 17:19:10 PST 2016

Hi, we have Lustre 2.7.58 in place on our OST and MDT/MGS (combined).  
Underlying the lustre file system is a raid-z2 zfs pool.

A few days ago, we lost 2 disks at once from the raid-z2.  I replaced 
one and a resilver started, that seemed to choke.  So, I put back both 
disks with replacements, and the new re-silver shows the following now.

[root at umdist03 ~]# zpool status -v ost-007
   pool: ost-007
  state: DEGRADED
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://zfsonlinux.org/msg/ZFS-8000-8A
   scan: resilvered 972G in 9h25m with 1 errors on Fri Mar 11 19:12:37 2016
config:

         NAME                                  STATE     READ WRITE CKSUM
         ost-007                               DEGRADED     0 0     1
           raidz2-0                            DEGRADED     0 0     4
             replacing-0                       DEGRADED     0 0     0
               18280868502819750645            UNAVAIL      0 0     0  
was /dev/disk/by-path/pci-0000:0c:00.0-scsi-0:2:20:0-part1/old
               pci-0000:0c:00.0-scsi-0:2:20:0  ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:21:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:22:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:23:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:24:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:35:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:36:0    ONLINE       1 0     0
             pci-0000:0c:00.0-scsi-0:2:37:0    ONLINE       0 0     0
             pci-0000:0c:00.0-scsi-0:2:38:0    ONLINE       0 0     0
             replacing-9                       UNAVAIL      0 0     0
               14369532488179106769            UNAVAIL      0 0     0  
was /dev/disk/by-path/pci-0000:0c:00.0-scsi-0:2:39:0-part1/old
               pci-0000:0c:00.0-scsi-0:2:39:0  ONLINE       0 0     0

errors: Permanent errors have been detected in the following files:

         ost-007/ost0030:<0x2c90f>

what are my options here?  If I don't care about the file, can I 
identify it and then just delete it?  Or is my only real option to drain 
the pool and rebuild it cleanly?

Thanks for any help/advice.

bob