[lustre-discuss] Error on a zpool underlying an OST
Kevin Abbey
kevin.abbey at rutgers.edu
Fri Jul 15 08:58:53 PDT 2016
Hi Bob,
Thank you for the notes. I began to examining the zpool before
obtaining the new LSI card. I was unable to start lustre without the
new card. Once I installed the replacement and re-examined the zpools
the resilvered pool was re-scrubbed, exported and reimported, and to my
surprise repaired. As a further test, I removed the spare disk that
replaced the "apparent" bad disk and re-added the disk that was
removed. The zpool resilvered ok and scrubbed clean. Lustre mounted
and cleaned a few orphaned blocks but appeared fully functional from the
client side. However, without a "snapshot" (file list, md5sums - though
zfs does internal check sums) of the prior status I cannot be sure if a
data file was lost. This is something I'll need to address. Maybe
Robinhood can help with this?
Thanks again for the notes. They will likely be useful in a similar
scenario.
Kevin
On 07/12/2016 09:10 AM, Bob Ball wrote:
> The answer came offline, and I guess I never replied back to the
> original posting. This is what I learned. It deals with only a
> single file, not 1000's. --bob
> -------------------------------
>
> On Mon, 14 Mar 2016, Bob Ball wrote:
>
> OK, it would seem the affected user has already deleted this file, as
> the "lfs fid2path" returns:
> [root at umt3int01 ~]# lfs fid2path /lustre/umt3 [0x200002582:0xb5c0:0x0]
> fid2path: error on FID [0x200002582:0xb5c0:0x0]: No such file or
> directory
>
> I verified I could to it back and forth using a different file.
>
> I am making one last check, with the OST re-activated (I had set it
> inactive on our MDT/MGS to keep new files off while figuring this out).
>
> Nope, gone. Time to do the clear and remove the snapshot.
>
> Thanks for your help on this.
>
> bob
>
> On 3/14/2016 10:45 AM, Don Holmgren wrote:
>
> No, no downside. The snapshot really is just used so that I can do this
> sort of repair live.
>
> Once you've found the Lustre OID with "find", for
> ll_decode_filter_fid to
> work you'll have to then umount the OST and remount as type lustre.
>
> Good luck!
>
> Don
>
> Thank you! This is very helpful.
>
> I have no space to make a snapshot, so I will just umount this OST for
> a bit and remount it zfs. Our users can take some off-time if we are
> not busy just then.
>
> It will be an interesting process. I'm all set to drain and remake
> though, should this method not work. I was putting that off to start
> until later today as I've other issues just now. Since it would take
> me 2-3 days total to drain, remake and refill, your detailed method is
> far more likeable for me.
>
> Just to be certain, other than the temporary unavailability of the
> Lustre file system, do you see any downside to not working from a
> snapshot?
>
> bob
>
>
> On 3/14/2016 10:21 AM, Don Holmgren wrote:
>
> Hi Bob -
>
> I only get the lustre-discuss digest, so am not sure how to reply to
> that
> whole list. But I can reply directly to you regarding your posting
> (copied at the bottom).
>
> In the ZFS error message
>
> errors: Permanent errors have been detected in the following files:
> ost-007/ost0030:<0x2c90f>
>
> 0x2c90f is the ZFS inode number of the damaged item. To turn this
> into a
> Lustre filename, do the following:
>
> 1. First, you have to use "find" using that inode number to get the
> corresponding
> Lustre object ID. I do this via a ZFS snapshot, something like:
>
> zfs snapshot ost-007/ost0030 at mar14
> mount -t zfs ost-007/ost0030 at mar14 /mnt/snapshot
> find /mnt/snapshot/O -inum 182543
>
> (note 0x2c90f = 182543 decimal). This may return something like
>
> /mnt/snapshot/O/0/d22/54
>
> if indeed the damaged item is a file object.
>
>
> 2. OK, assuming the "find" did return a file object like above (in this
> case the
> Lustre OID of the object is 54) you need to find the parent "FID" of
> that
> OID. Do this as follows on the OSS where you've mounted the
> snapshot:
>
> [root at lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
> /mnt/snapshot/O/0/d22/54: parent=[0x20000040000010a:0x0:0x0] stripe=0
>
>
> 3. That string "0x20000040000010a:0x0:0x0" is related to the Lustre FID.
> You
> can use "lfs fid2path" to convert this to a filename. "lfs fid2path"
> must be
> execute on a client of your Lustre filesystem. And, on our
> Lustre, the
> return string must be slightly altered (chopped up differently):
>
> [root at client ~]# lfs fid2path /djhzlus [0x200000400:0x10a:0x0]
> /djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002
>
>
> Here /djhzlus was where the Lustre filesystem was mounted on my
> client
> (client). fid2path takes three numbers, in my case the first was
> the first 9 hex digits of the return from ll_decode_filter_fid, and
> the
> second was the last 5 hex digits (I supressed the leading zeros) and
> the
> third was 0x0 (not sure whether this was the 2nd or 3rd field from
> ll_decode_filter_fid.
>
> You can always use "lfs path2fid" on your Lustre client against
> another
> file in your filesystem to find the pattern for your FID.
>
> To check that you've indeed found the correct file, you can do
> "lfs getstripe" to confirm that the objid matches the Lustre OID you
> got with the find.
>
>
> Once you figure out the bad file, you can delete it from Lustre, and
> then
> use "zpool clear ost-007" to clear the reporting of
> ost-007/ost0030:<0x2c90f>
> Don't forget to umount and delete your ZFS snapshot of the OST with the
> bad file.
>
>
> I should mention that I found a Python script ("zfsobj2fid") somewhere
> that directly returns the FID using the ZFS debugger ("zdb") directly
> against the mounted OST. You can probably google for zfsobj2fid; if you
> can't find it let me know and I'll dig around to see if I still have a
> copy. Here's how I used it to get the FID for "lfs fid2path":
>
> [root at lustrenew3 ~]# ./zfsobj2fid zp2/ost2 0x113
> [0x20000040000010a:0x0:0x0]
>
> (my OID was 0x113, my pool was "zp2" and the ZFS OST was "ost2"). But,
> note that the FID returned still needs to be manipulated as above. I
> found this note in one of my write-ups about this manipulation:
>
> "Evidentally, the 'trusted.fid' xattr kept in ZFS for the OID file
> contains both the first and second sections of the FID (according to
> some
> slide decks I found, the FID is [sequence:objectID:version], so the
> xattr
> has the sequence and the objectID."
>
>
> Cheers -
>
> Don Holmgren
> Fermilab
>
> On 7/12/2016 12:02 AM, Kevin Abbey wrote:
>> Hi,
>>
>> Can anyone advise how to clean up 1000s of zfs level permanent errors
>> and the lustre level too?
>>
>> A similar question was presented on the list but I did not see an
>> answer.
>> https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12454.html
>>
>>
>> As I was testing new hardware I discovered an LSI HBA was bad. On a
>> single combined MDS/OSS there were 8 OSTs split across 2 jbod and 2
>> LSI HBA. The mdt was on a 3rd jbod downlinked on the jbod connected
>> with the bad controller. The zpools connected to the good HBA were
>> scrubed clean after unmounting and stopping lustre. The zpools on the
>> bad controller continued to have errors while connected to the bad
>> controller. One of these OSTs reported a disk failure during the
>> scrub and began resilvering yet autoreplace was off. This is a
>> very bad event considering the card was causing all of the errors.
>> Neither a scrub or resilver would ever complete. I stopped the scrub
>> on the 3 other osts and detached the spare from the ost in resilver
>> process. After narrowing down the bad HBA (initially it was not
>> clear if cables or jbod backplanes were bad), I use the good HBA to
>> scrub the jbod 1 again, then shutdown disconnected the jbod1. Then
>> proceeded to connect the jbod2 to the good controller to scrub the
>> jbod 2 zpools which had previously been attached to the bad LSI
>> controller. The 3 zpools which had scrub stopped previously did
>> complete successfully. The one which had begun resilvering began
>> again to resilver after I initiated a replace of the failed disk with
>> the spare. The resilver completed but many permanent errors were
>> discovered on the zpool. Since this is a test pool I was interested
>> to know if zfs would recover. In a real scenario with HW problems
>> I'll shutdown and disconnect the data drives prior to HW testing.
>>
>> The status listed below shows a new scrub in process after the
>> resilver completed. The cache drive is missing because the 3rd jbod
>> is disconnected temporarily.
>>
>>
>> ===================================
>>
>> ZFS: v0.6.5.7-1
>> lustre 2.8.55
>> kernel 2.6.32_642.1.1.el6.x86_64.x86_64
>> Centos 6.8
>>
>>
>> ===================================
>> ~]# zpool status -v test-ost4
>> pool: test-ost4
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>> corruption. Applications may be affected.
>> action: Restore the file in question if possible. Otherwise restore the
>> entire pool from backup.
>> see: http://zfsonlinux.org/msg/ZFS-8000-8A
>> scan: scrub in progress since Mon Jul 11 22:29:09 2016
>> 689G scanned out of 12.4T at 711M/s, 4h49m to go
>> 40K repaired, 5.41% done
>> config:
>>
>> NAME STATE READ WRITE CKSUM
>> test-ost4 ONLINE 0 0 180
>> raidz2-0 ONLINE 0 0 360
>> ata-ST4000NM0033-9ZM170_Z1Z7GYXY ONLINE 0 0 2
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7KKPQ ONLINE 0 0 3
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7L5E7 ONLINE 0 0 3
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7KGQT ONLINE 0 0 0
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7LA8K ONLINE 0 0 4
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7KB0X ONLINE 0 0 3
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7JSMN ONLINE 0 0 2
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7KXRA ONLINE 0 0 2
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7MLSN ONLINE 0 0 2
>> (repairing)
>> ata-ST4000NM0033-9ZM170_Z1Z7L4DT ONLINE 0 0 7
>> (repairing)
>> cache
>> ata-D2CSTK251M20-0240_A19CV011227000092 UNAVAIL 0 0 0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>> test-ost4/test-ost4:<0xe00>
>> test-ost4/test-ost4:<0xe01>
>> test-ost4/test-ost4:<0xe02>
>> test-ost4/test-ost4:<0xe03>
>> test-ost4/test-ost4:<0xe04>
>> test-ost4/test-ost4:<0xe05>
>> test-ost4/test-ost4:<0xe06>.......
>> .......
>> .......continues......
>> .......
>> .......
>> test-ost4/test-ost4:<0xdfe>
>> test-ost4/test-ost4:<0xdff>
>> ===================================
>>
>> Follow up questions,
>>
>> Is is better to not have a spare attached to the pool to prevent
>> resilvering in this scenario? (bad HBA, disk failed during scrub,
>> resilver began, yet auto relplace was off. The spare was assigned to
>> the zpool.)
>>
>> In a dual path to the jbod would the bad HBA card be disabled
>> automatically to prevent IO errors reaching the disk? The current
>> setup is single path only.
>>
>>
>> Thank you for any notes in advance,
>> Kevin
>>
>
--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/
Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160715/53836ef8/attachment-0001.htm>
More information about the lustre-discuss
mailing list