[lustre-discuss] LustreError on ZFS volumes

Tue Dec 13 11:15:28 PST 2016

We discussed a course of action this morning and decided that we'd start 
by migrating the files off of the OST. Testing suggests files that 
cannot be completely read will be left on OST0002.

Due to the nature of the corruption - faulty hardware raid controller - 
it seems unlikely we'll be able to meaningfully save any files that were 
corrupted. This is something we may evaluate more closely once the 
lfs_migrate is complete and we have our file list.

We'll then share the list of corrupted files with our users and find out 
the cost of the lost data. If it's reasonably reproducible, we'll 
reinitialize the RAID array and reformat the vdev.

Thanks for your help, Tom!

Best,
Jesse Stroik

On 12/12/2016 03:51 PM, Crowe, Tom wrote:
> Hi Jessie,
>
> In regards to you seeing 370 objects with errors form ‘zpool status’, but having over 400 files with “access issues”, I would suggest running the ‘zpool scrub’ to identify all the ZFS objects in the pool that are reporting permanent errors.
>
> It would be very important to have a complete list of files w/issues, before replicating the VDEV(s) in question.
>
> You may also want to dump the zdb information for the source VDEV(s) with the following:
>
> zdb -dddddd source_pool/source_vdev > /some/where/with/room
>
> For example, if the zpool was named pool-01, and the VDEV was named lustre-0001 and you had free space in a filesystem named /home:
>
> zdb -dddddd pool-01/lustre-0001 > /home/zdb_pool-01_0001_20161212.out
>
> There is a great wealth of data zdb can share about your files. Having the output may prove helpful down the road.
>
> Thanks,
> Tom
>
>> On Dec 12, 2016, at 4:39 PM, Jesse Stroik <jesse.stroik at ssec.wisc.edu> wrote:
>>
>> Thanks for taking the time to respond, Tom,
>>
>>
>>> For clarification, it sounds like you are using hardware based RAID-6, and not ZFS raid? Is this correct? Or was the faulty card simply an HBA?
>>
>>
>> You are correct. This particular file system is still using hardware RAID6.
>>
>>
>>> At the bottom of the ‘zpool status -v pool_name’ output, you may see paths and/or zfs object ID’s of the damaged/impacted files. This would be good to take note of.
>>
>>
>> Yes, I output this to files at a few different times and we've had no chance since replacing the RAID controller, which makes me feel reasonably comfortable leaving the file system in production.
>>
>> There are 370 objects listed by zpool status -v but I am unable to access at least 400 files. Almost all of our files are single stripe.
>>
>>
>>> Running a ‘zpool scrub’ is a good idea. If the zpool is protected with "ZFS raid", the scrub may be able to repair some of the damage. If the zpool is not protected with "ZFS raid", the scrub will identify any other errors, but likely NOT repair any of the damage.
>>
>>
>> We're not protected with ZFS RAID, just hardware raid6. I could run a patrol on the hardware controller and then a ZFS scrub if that makes the most sense at this point. This file system is scheduled to run a scrub the third week of every month so it would run one this weekend otherwise.
>>
>>
>>
>>> If you have enough disk space on hardware that is behaving properly (and free space in the source zpool), you may want to replicate the VDEV’s (OST) that are reporting errors. Having a replicated VDEV can afford you the ability to examine the data without fear of further damage. You may also want to extract certain files from the replicated VDEV(s) which are producing IO errors on the source VDEV.
>>>
>>> Something like this for replication should work:
>>>
>>> zfs snap source_pool/source_ost at timestamp_label
>>> zfs send -Rv source_pool/source_ost at timestamp_label | zfs receive destination_pool/source_oat_replicated
>>>
>>> You will need to set zfs_send_corrupt_data to 1 in /sys/module/zfs/parameters or the ‘zfs send’ will error and fail when sending a VDEV with read and/or checksum errors.
>>> Enabling zfs_send_corrupt_data allows the zfs send operation to complete. Any blocks that are damaged on the source side, will have “x2f5baddb10c” replaced in the bad blocks on the destination side. This can be helpful in troubleshooting if an entire file is corrupt, or parts of the file.
>>>
>>> After the replication, you should set the replicated VDEV to read only with ‘zfs set readonly=on destination_pool/source_ost_replicated’
>>>
>>
>> Thank you for this suggestion. We'll most likely do that.
>>
>> Best,
>> Jesse Stroik
>>
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3964 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161213/c18cc0c5/attachment-0001.bin>