[lustre-discuss] Error on a zpool underlying an OST

Fri Jul 15 08:58:53 PDT 2016

Hi Bob,

Thank you for the notes.  I began to examining the zpool before 
obtaining the new LSI card.  I was unable to start lustre without the 
new card. Once I installed the replacement and re-examined the zpools 
the resilvered pool was re-scrubbed, exported and reimported, and to my 
surprise repaired.  As a further test, I removed the spare disk that 
replaced the "apparent" bad disk and re-added the disk that was 
removed.  The zpool resilvered ok and scrubbed clean.  Lustre mounted 
and cleaned a few orphaned blocks but appeared fully functional from the 
client side.  However, without a "snapshot" (file list, md5sums - though 
zfs does internal check sums) of the prior status I cannot be sure if a 
data file was lost.  This is something I'll need to address. Maybe 
Robinhood can help with this?

Thanks again for the notes.  They will likely be useful in a similar 
scenario.

Kevin

On 07/12/2016 09:10 AM, Bob Ball wrote:
> The answer came offline, and I guess I never replied back to the 
> original posting.  This is what I learned.  It deals with only a 
> single file, not 1000's.  --bob
> -------------------------------
>
> On Mon, 14 Mar 2016, Bob Ball wrote:
>
> OK, it would seem the affected user has already deleted this file, as 
> the "lfs fid2path" returns:
> [root at umt3int01 ~]# lfs fid2path /lustre/umt3 [0x200002582:0xb5c0:0x0]
> fid2path: error on FID [0x200002582:0xb5c0:0x0]: No such file or 
> directory
>
> I verified I could to it back and forth using a different file.
>
> I am making one last check, with the OST re-activated (I had set it 
> inactive on our MDT/MGS to keep new files off while figuring this out).
>
> Nope, gone.  Time to do the clear and remove the snapshot.
>
> Thanks for your help on this.
>
> bob
>
> On 3/14/2016 10:45 AM, Don Holmgren wrote:
>
>  No, no downside.  The snapshot really is just used so that I can do this
>  sort of repair live.
>
>  Once you've found the Lustre OID with "find", for 
> ll_decode_filter_fid to
>  work you'll have to then umount the OST and remount as type lustre.
>
>  Good luck!
>
>  Don
>
> Thank you!  This is very helpful.
>
> I have no space to make a snapshot, so I will just umount this OST for 
> a bit and remount it zfs.  Our users can take some off-time if we are 
> not busy just then.
>
> It will be an interesting process.  I'm all set to drain and remake 
> though, should this method not work.  I was putting that off to start 
> until later today as I've other issues just now. Since it would take 
> me 2-3 days total to drain, remake and refill, your detailed method is 
> far more likeable for me.
>
> Just to be certain, other than the temporary unavailability of the 
> Lustre file system, do you see any downside to not working from a 
> snapshot?
>
> bob
>
>
> On 3/14/2016 10:21 AM, Don Holmgren wrote:
>
>  Hi Bob -
>
>  I only get the lustre-discuss digest, so am not sure how to reply to 
> that
>  whole list.  But I can reply directly to you regarding your posting
>  (copied at the bottom).
>
>  In the ZFS error message
>
>     errors: Permanent errors have been detected in the following files:
>           ost-007/ost0030:<0x2c90f>
>
>  0x2c90f is the ZFS inode number of the damaged item.  To turn this 
> into a
>  Lustre filename, do the following:
>
>  1. First, you have to use "find" using that inode number to get the
>  corresponding
>     Lustre object ID.  I do this via a ZFS snapshot, something like:
>
>     zfs snapshot ost-007/ost0030 at mar14
>     mount -t zfs ost-007/ost0030 at mar14 /mnt/snapshot
>     find /mnt/snapshot/O -inum 182543
>
>  (note 0x2c90f = 182543 decimal).  This may return something like
>
>     /mnt/snapshot/O/0/d22/54
>
>  if indeed the damaged item is a file object.
>
>
>  2. OK, assuming the "find" did return a file object like above (in this
>  case the
>     Lustre OID of the object is 54) you need to find the parent "FID" of
>  that
>     OID.  Do this as follows on the OSS where you've mounted the 
> snapshot:
>
>     [root at lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
>     /mnt/snapshot/O/0/d22/54: parent=[0x20000040000010a:0x0:0x0] stripe=0
>
>
>  3. That string "0x20000040000010a:0x0:0x0" is related to the Lustre FID.
>  You
>     can use "lfs fid2path" to convert this to a filename.  "lfs fid2path"
>  must be
>     execute on a client of your Lustre filesystem.  And, on our 
> Lustre, the
>     return string must be slightly altered (chopped up differently):
>
>      [root at client ~]# lfs fid2path /djhzlus [0x200000400:0x10a:0x0]
>  /djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002 
>
>
>     Here /djhzlus was where the Lustre filesystem was mounted on my 
> client
>     (client).  fid2path takes three numbers, in my case the first was
>     the first 9 hex digits of the return from ll_decode_filter_fid, and
>  the
>     second was the last 5 hex digits (I supressed the leading zeros) and
>  the
>     third was 0x0 (not sure whether this was the 2nd or 3rd field from
>     ll_decode_filter_fid.
>
>     You can always use "lfs path2fid" on your Lustre client against 
> another
>     file in your filesystem to find the pattern for your FID.
>
>     To check that you've indeed found the correct file, you can do
>     "lfs getstripe" to confirm that the objid matches the Lustre OID you
>     got with the find.
>
>
>  Once you figure out the bad file, you can delete it from Lustre, and 
> then
>  use "zpool clear ost-007" to clear the reporting of
>           ost-007/ost0030:<0x2c90f>
>  Don't forget to umount and delete your ZFS snapshot of the OST with the
>  bad file.
>
>
>  I should mention that I found a Python script ("zfsobj2fid") somewhere
>  that directly returns the FID using the ZFS debugger ("zdb") directly
>  against the mounted OST.  You can probably google for zfsobj2fid; if you
>  can't find it let me know and I'll dig around to see if I still have a
>  copy.  Here's how I used it to get the FID for "lfs fid2path":
>
>      [root at lustrenew3 ~]# ./zfsobj2fid zp2/ost2 0x113
>      [0x20000040000010a:0x0:0x0]
>
>  (my OID was 0x113, my pool was "zp2" and the ZFS OST was "ost2"). But,
>  note that the FID returned still needs to be manipulated as above. I
>  found this note in one of my write-ups about this manipulation:
>
>  "Evidentally, the 'trusted.fid' xattr kept in ZFS for the OID file
>  contains both the first and second sections of the FID (according to 
> some
>  slide decks I found, the FID is [sequence:objectID:version], so the 
> xattr
>  has the sequence and the objectID."
>
>
>  Cheers -
>
>  Don Holmgren
>  Fermilab
>
> On 7/12/2016 12:02 AM, Kevin Abbey wrote:
>> Hi,
>>
>> Can anyone advise how to clean up 1000s of zfs level permanent errors 
>> and the lustre level too?
>>
>> A similar question was presented on the list but I did not see an 
>> answer.
>> https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12454.html 
>>
>>
>> As I was testing new hardware I discovered an LSI HBA was bad. On a 
>> single combined MDS/OSS there were 8 OSTs split across 2 jbod and 2 
>> LSI HBA.  The mdt was on a 3rd jbod downlinked on the jbod connected 
>> with the bad controller.  The zpools connected to the good HBA were 
>> scrubed clean after unmounting and stopping lustre. The zpools on the 
>> bad controller continued to have errors while connected to the bad 
>> controller.  One of these OSTs reported a disk failure during the 
>> scrub and began resilvering yet autoreplace was off.    This is a 
>> very bad event considering the card was causing all of the errors.  
>> Neither a scrub or resilver would ever complete.  I stopped the scrub 
>> on the 3 other osts and detached the spare from the ost in resilver 
>> process.  After narrowing down the bad HBA (initially it was not 
>> clear if cables or jbod backplanes were bad), I use the good HBA to 
>> scrub the jbod 1 again, then shutdown disconnected the jbod1.  Then 
>> proceeded to connect the jbod2 to the good controller to scrub the 
>> jbod 2 zpools which had previously been attached to the bad LSI 
>> controller.  The 3 zpools which had scrub stopped previously did 
>> complete successfully.  The one which had begun resilvering began 
>> again to resilver after I initiated a replace of the failed disk with 
>> the spare.  The resilver completed but many permanent errors were 
>> discovered on the zpool.  Since this is a test pool I was interested 
>> to know if zfs would recover.  In a real scenario with HW problems 
>> I'll shutdown and disconnect the data drives prior to HW testing.
>>
>> The status listed below shows a new scrub in process after the 
>> resilver completed.  The cache drive is missing because the 3rd jbod 
>> is disconnected temporarily.
>>
>>
>> ===================================
>>
>> ZFS:   v0.6.5.7-1
>> lustre 2.8.55
>> kernel 2.6.32_642.1.1.el6.x86_64.x86_64
>> Centos 6.8
>>
>>
>> ===================================
>>   ~]# zpool status -v test-ost4
>>   pool: test-ost4
>>  state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>     corruption.  Applications may be affected.
>> action: Restore the file in question if possible. Otherwise restore the
>>     entire pool from backup.
>>    see: http://zfsonlinux.org/msg/ZFS-8000-8A
>>   scan: scrub in progress since Mon Jul 11 22:29:09 2016
>>     689G scanned out of 12.4T at 711M/s, 4h49m to go
>>     40K repaired, 5.41% done
>> config:
>>
>>     NAME                                       STATE READ WRITE CKSUM
>>     test-ost4                                  ONLINE 0     0 180
>>       raidz2-0                                 ONLINE 0     0 360
>>         ata-ST4000NM0033-9ZM170_Z1Z7GYXY       ONLINE 0     0 2 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7KKPQ       ONLINE 0     0 3 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7L5E7       ONLINE 0     0 3 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7KGQT       ONLINE 0     0 0 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7LA8K       ONLINE 0     0 4 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7KB0X       ONLINE 0     0 3 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7JSMN       ONLINE 0     0 2 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7KXRA       ONLINE 0     0 2 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7MLSN       ONLINE 0     0 2 
>> (repairing)
>>         ata-ST4000NM0033-9ZM170_Z1Z7L4DT       ONLINE 0     0 7 
>> (repairing)
>>     cache
>>       ata-D2CSTK251M20-0240_A19CV011227000092  UNAVAIL 0     0 0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>         test-ost4/test-ost4:<0xe00>
>>         test-ost4/test-ost4:<0xe01>
>>         test-ost4/test-ost4:<0xe02>
>>         test-ost4/test-ost4:<0xe03>
>>         test-ost4/test-ost4:<0xe04>
>>         test-ost4/test-ost4:<0xe05>
>>         test-ost4/test-ost4:<0xe06>.......
>>     .......
>>     .......continues......
>>     .......
>>     .......
>>         test-ost4/test-ost4:<0xdfe>
>>         test-ost4/test-ost4:<0xdff>
>> ===================================
>>
>> Follow up questions,
>>
>> Is is better to not have a spare attached to the pool to prevent 
>> resilvering in this scenario?  (bad HBA, disk failed during scrub, 
>> resilver began, yet auto relplace was off.  The spare was assigned to 
>> the zpool.)
>>
>> In a dual path to the jbod would the bad HBA card be disabled 
>> automatically to prevent IO errors reaching the disk?  The current 
>> setup is single path only.
>>
>>
>> Thank you for any notes in advance,
>> Kevin
>>
>

-- 
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: kevin.abbey at rutgers.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20160715/53836ef8/attachment-0001.htm>