[lustre-discuss] LustreError on ZFS volumes
jesse.stroik at ssec.wisc.edu
Mon Dec 12 12:33:54 PST 2016
One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3
experienced corruption due to a bad RAID controller. The OST in question
was a RAID6 volume which we've marked inactive. Most of our lustre
clients are 2.8.0.
zfs status reports corruption and checksum errors. I have not run a
scrub since the corruption was detected but we did replace the bad RAID
controller and subsequent write tests to that OST have been fine. We
haven't seen a change in the error count with the new raid controller.
We're observing two types of errors. The first is when we attempt to
perform a long listing of a file to get its meta data we get "cannot
allocate memory" from our client. On the OSS in question, it's logged as:
odyssey-OST0002: lvbo_init failed for resource 0x8ccfa8:0x0: rc = -5
odyssey-OST0002: lookup [0x100000000:0x8ccf64:0x0]/0x78ed06 failed: rc = -5
As far as we can tell, this primarily affects recently written files and
we're presently using robinhood to generate a file listing from OST2 to
try to verify all files for this particular error.
We do have another error: attempts to read a few of our larger files on
that OST result in I/O errors after a partial read. I'm not sure why
this would have happened with the bad RAID controller as the two files
we're aware of weren't being written to.
I'm interested to learn a bit more about these particular Lustre errors
and return code and what our most likely recovery options are.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 3964 bytes
Desc: S/MIME Cryptographic Signature
More information about the lustre-discuss