[lustre-discuss] LustreError on ZFS volumes

Mon Dec 12 13:28:59 PST 2016

Hi Jessie,

For clarification, it sounds like you are using hardware based RAID-6, and not ZFS raid? Is this correct? Or was the faulty card simply an HBA?

At the bottom of the ‘zpool status -v pool_name’ output, you may see paths and/or zfs object ID’s of the damaged/impacted files. This would be good to take note of.

Running a ‘zpool scrub’ is a good idea. If the zpool is protected with "ZFS raid", the scrub may be able to repair some of the damage. If the zpool is not protected with "ZFS raid", the scrub will identify any other errors, but likely NOT repair any of the damage.

If you have enough disk space on hardware that is behaving properly (and free space in the source zpool), you may want to replicate the VDEV’s (OST) that are reporting errors. Having a replicated VDEV can afford you the ability to examine the data without fear of further damage. You may also want to extract certain files from the replicated VDEV(s) which are producing IO errors on the source VDEV.

Something like this for replication should work:

zfs snap source_pool/source_ost at timestamp_label
zfs send -Rv source_pool/source_ost at timestamp_label | zfs receive destination_pool/source_oat_replicated

You will need to set zfs_send_corrupt_data to 1 in /sys/module/zfs/parameters or the ‘zfs send’ will error and fail when sending a VDEV with read and/or checksum errors.
Enabling zfs_send_corrupt_data allows the zfs send operation to complete. Any blocks that are damaged on the source side, will have “x2f5baddb10c” replaced in the bad blocks on the destination side. This can be helpful in troubleshooting if an entire file is corrupt, or parts of the file. 

After the replication, you should set the replicated VDEV to read only with ‘zfs set readonly=on destination_pool/source_ost_replicated’

Hopefully others can chime in about the Lustre errors you have noted. 

Thanks,
Tom

> On Dec 12, 2016, at 3:33 PM, Jesse Stroik <jesse.stroik at ssec.wisc.edu> wrote:
> 
> One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3 experienced corruption due to a bad RAID controller. The OST in question was a RAID6 volume which we've marked inactive. Most of our lustre clients are 2.8.0.
> 
> zfs status reports corruption and checksum errors. I have not run a scrub since the corruption was detected but we did replace the bad RAID controller and subsequent write tests to that OST have been fine. We haven't seen a change in the error count with the new raid controller.
> 
> We're observing two types of errors. The first is when we attempt to perform a long listing of a file to get its meta data we get "cannot allocate memory" from our client. On the OSS in question, it's logged as:
> 
> ============
> LustreError: 10394:0:(ldlm_resource.c:1188:ldlm_resource_get()) odyssey-OST0002: lvbo_init failed for resource 0x8ccfa8:0x0: rc = -5
> LustreError: 8855:0:(osd_object.c:409:osd_object_init()) odyssey-OST0002: lookup [0x100000000:0x8ccf64:0x0]/0x78ed06 failed: rc = -5
> ============
> 
> As far as we can tell, this primarily affects recently written files and we're presently using robinhood to generate a file listing from OST2 to try to verify all files for this particular error.
> 
> We do have another error: attempts to read a few of our larger files on that OST result in I/O errors after a partial read. I'm not sure why this would have happened with the bad RAID controller as the two files we're aware of weren't being written to.
> 
> I'm interested to learn a bit more about these particular Lustre errors and return code and what our most likely recovery options are.
> 
> Best,
> Jesse
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org