[lustre-discuss] LustreError on ZFS volumes

Mon Dec 12 12:33:54 PST 2016

One of our lustre file systems still running lustre 2.5.3 and zfs 0.6.3 
experienced corruption due to a bad RAID controller. The OST in question 
was a RAID6 volume which we've marked inactive. Most of our lustre 
clients are 2.8.0.

zfs status reports corruption and checksum errors. I have not run a 
scrub since the corruption was detected but we did replace the bad RAID 
controller and subsequent write tests to that OST have been fine. We 
haven't seen a change in the error count with the new raid controller.

We're observing two types of errors. The first is when we attempt to 
perform a long listing of a file to get its meta data we get "cannot 
allocate memory" from our client. On the OSS in question, it's logged as:

============
LustreError: 10394:0:(ldlm_resource.c:1188:ldlm_resource_get()) 
odyssey-OST0002: lvbo_init failed for resource 0x8ccfa8:0x0: rc = -5
LustreError: 8855:0:(osd_object.c:409:osd_object_init()) 
odyssey-OST0002: lookup [0x100000000:0x8ccf64:0x0]/0x78ed06 failed: rc = -5
============

As far as we can tell, this primarily affects recently written files and 
we're presently using robinhood to generate a file listing from OST2 to 
try to verify all files for this particular error.

We do have another error: attempts to read a few of our larger files on 
that OST result in I/O errors after a partial read. I'm not sure why 
this would have happened with the bad RAID controller as the two files 
we're aware of weren't being written to.

I'm interested to learn a bit more about these particular Lustre errors 
and return code and what our most likely recovery options are.

Best,
Jesse

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3964 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161212/d69d4a10/attachment.bin>