[lustre-discuss] Error on a zpool underlying an OST

Tue Jul 12 06:10:24 PDT 2016

The answer came offline, and I guess I never replied back to the 
original posting.  This is what I learned.  It deals with only a single 
file, not 1000's.  --bob
-------------------------------

On Mon, 14 Mar 2016, Bob Ball wrote:

OK, it would seem the affected user has already deleted this file, as 
the "lfs fid2path" returns:
[root at umt3int01 ~]# lfs fid2path /lustre/umt3 [0x200002582:0xb5c0:0x0]
fid2path: error on FID [0x200002582:0xb5c0:0x0]: No such file or directory

I verified I could to it back and forth using a different file.

I am making one last check, with the OST re-activated (I had set it 
inactive on our MDT/MGS to keep new files off while figuring this out).

Nope, gone.  Time to do the clear and remove the snapshot.

Thanks for your help on this.

bob

On 3/14/2016 10:45 AM, Don Holmgren wrote:

  No, no downside.  The snapshot really is just used so that I can do this
  sort of repair live.

  Once you've found the Lustre OID with "find", for ll_decode_filter_fid to
  work you'll have to then umount the OST and remount as type lustre.

  Good luck!

  Don

Thank you!  This is very helpful.

I have no space to make a snapshot, so I will just umount this OST for a 
bit and remount it zfs.  Our users can take some off-time if we are not 
busy just then.

It will be an interesting process.  I'm all set to drain and remake 
though, should this method not work.  I was putting that off to start 
until later today as I've other issues just now.  Since it would take me 
2-3 days total to drain, remake and refill, your detailed method is far 
more likeable for me.

Just to be certain, other than the temporary unavailability of the 
Lustre file system, do you see any downside to not working from a snapshot?

bob

On 3/14/2016 10:21 AM, Don Holmgren wrote:

  Hi Bob -

  I only get the lustre-discuss digest, so am not sure how to reply to that
  whole list.  But I can reply directly to you regarding your posting
  (copied at the bottom).

  In the ZFS error message

     errors: Permanent errors have been detected in the following files:
           ost-007/ost0030:<0x2c90f>

  0x2c90f is the ZFS inode number of the damaged item.  To turn this into a
  Lustre filename, do the following:

  1. First, you have to use "find" using that inode number to get the
  corresponding
     Lustre object ID.  I do this via a ZFS snapshot, something like:

     zfs snapshot ost-007/ost0030 at mar14
     mount -t zfs ost-007/ost0030 at mar14 /mnt/snapshot
     find /mnt/snapshot/O -inum 182543

  (note 0x2c90f = 182543 decimal).  This may return something like

     /mnt/snapshot/O/0/d22/54

  if indeed the damaged item is a file object.

  2. OK, assuming the "find" did return a file object like above (in this
  case the
     Lustre OID of the object is 54) you need to find the parent "FID" of
  that
     OID.  Do this as follows on the OSS where you've mounted the snapshot:

     [root at lustrenew3 ~]# ll_decode_filter_fid /mnt/snapshot/O/0/d22/54
     /mnt/snapshot/O/0/d22/54: parent=[0x20000040000010a:0x0:0x0] stripe=0

  3. That string "0x20000040000010a:0x0:0x0" is related to the Lustre FID.
  You
     can use "lfs fid2path" to convert this to a filename.  "lfs fid2path"
  must be
     execute on a client of your Lustre filesystem.  And, on our Lustre, 
the
     return string must be slightly altered (chopped up differently):

      [root at client ~]# lfs fid2path /djhzlus [0x200000400:0x10a:0x0]
  /djhzlus/test/copy1/l6496f21b7075m00155m031/gauge/Coulomb/l6496f21b7075m00155m031-Coul_002

     Here /djhzlus was where the Lustre filesystem was mounted on my client
     (client).  fid2path takes three numbers, in my case the first was
     the first 9 hex digits of the return from ll_decode_filter_fid, and
  the
     second was the last 5 hex digits (I supressed the leading zeros) and
  the
     third was 0x0 (not sure whether this was the 2nd or 3rd field from
     ll_decode_filter_fid.

     You can always use "lfs path2fid" on your Lustre client against 
another
     file in your filesystem to find the pattern for your FID.

     To check that you've indeed found the correct file, you can do
     "lfs getstripe" to confirm that the objid matches the Lustre OID you
     got with the find.

  Once you figure out the bad file, you can delete it from Lustre, and then
  use "zpool clear ost-007" to clear the reporting of
           ost-007/ost0030:<0x2c90f>
  Don't forget to umount and delete your ZFS snapshot of the OST with the
  bad file.

  I should mention that I found a Python script ("zfsobj2fid") somewhere
  that directly returns the FID using the ZFS debugger ("zdb") directly
  against the mounted OST.  You can probably google for zfsobj2fid; if you
  can't find it let me know and I'll dig around to see if I still have a
  copy.  Here's how I used it to get the FID for "lfs fid2path":

      [root at lustrenew3 ~]# ./zfsobj2fid zp2/ost2 0x113
      [0x20000040000010a:0x0:0x0]

  (my OID was 0x113, my pool was "zp2" and the ZFS OST was "ost2"). But,
  note that the FID returned still needs to be manipulated as above. I
  found this note in one of my write-ups about this manipulation:

  "Evidentally, the 'trusted.fid' xattr kept in ZFS for the OID file
  contains both the first and second sections of the FID (according to some
  slide decks I found, the FID is [sequence:objectID:version], so the xattr
  has the sequence and the objectID."

  Cheers -

  Don Holmgren
  Fermilab

On 7/12/2016 12:02 AM, Kevin Abbey wrote:
> Hi,
>
> Can anyone advise how to clean up 1000s of zfs level permanent errors 
> and the lustre level too?
>
> A similar question was presented on the list but I did not see an answer.
> https://www.mail-archive.com/lustre-discuss@lists.lustre.org/msg12454.html 
>
>
> As I was testing new hardware I discovered an LSI HBA was bad.  On a 
> single combined MDS/OSS there were 8 OSTs split across 2 jbod and 2 
> LSI HBA.  The mdt was on a 3rd jbod downlinked on the jbod connected 
> with the bad controller.  The zpools connected to the good HBA were 
> scrubed clean after unmounting and stopping lustre. The zpools on the 
> bad controller continued to have errors while connected to the bad 
> controller.  One of these OSTs reported a disk failure during the 
> scrub and began resilvering yet autoreplace was off.    This is a very 
> bad event considering the card was causing all of the errors.  Neither 
> a scrub or resilver would ever complete.  I stopped the scrub on the 3 
> other osts and detached the spare from the ost in resilver process.  
> After narrowing down the bad HBA (initially it was not clear if cables 
> or jbod backplanes were bad), I use the good HBA to scrub the jbod 1 
> again, then shutdown disconnected the jbod1.  Then proceeded to 
> connect the jbod2 to the good controller to scrub the jbod 2 zpools 
> which had previously been attached to the bad LSI controller.  The 3 
> zpools which had scrub stopped previously did complete successfully.  
> The one which had begun resilvering began again to resilver after I 
> initiated a replace of the failed disk with the spare.  The resilver 
> completed but many permanent errors were discovered on the zpool.  
> Since this is a test pool I was interested to know if zfs would 
> recover.  In a real scenario with HW problems I'll shutdown and 
> disconnect the data drives prior to HW testing.
>
> The status listed below shows a new scrub in process after the 
> resilver completed.  The cache drive is missing because the 3rd jbod 
> is disconnected temporarily.
>
>
> ===================================
>
> ZFS:   v0.6.5.7-1
> lustre 2.8.55
> kernel 2.6.32_642.1.1.el6.x86_64.x86_64
> Centos 6.8
>
>
> ===================================
>   ~]# zpool status -v test-ost4
>   pool: test-ost4
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>     corruption.  Applications may be affected.
> action: Restore the file in question if possible. Otherwise restore the
>     entire pool from backup.
>    see: http://zfsonlinux.org/msg/ZFS-8000-8A
>   scan: scrub in progress since Mon Jul 11 22:29:09 2016
>     689G scanned out of 12.4T at 711M/s, 4h49m to go
>     40K repaired, 5.41% done
> config:
>
>     NAME                                       STATE READ WRITE CKSUM
>     test-ost4                                  ONLINE 0     0 180
>       raidz2-0                                 ONLINE 0     0 360
>         ata-ST4000NM0033-9ZM170_Z1Z7GYXY       ONLINE 0     0 2  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7KKPQ       ONLINE 0     0 3  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7L5E7       ONLINE 0     0 3  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7KGQT       ONLINE 0     0 0  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7LA8K       ONLINE 0     0 4  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7KB0X       ONLINE 0     0 3  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7JSMN       ONLINE 0     0 2  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7KXRA       ONLINE 0     0 2  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7MLSN       ONLINE 0     0 2  
> (repairing)
>         ata-ST4000NM0033-9ZM170_Z1Z7L4DT       ONLINE 0     0 7  
> (repairing)
>     cache
>       ata-D2CSTK251M20-0240_A19CV011227000092  UNAVAIL 0     0 0
>
> errors: Permanent errors have been detected in the following files:
>
>         test-ost4/test-ost4:<0xe00>
>         test-ost4/test-ost4:<0xe01>
>         test-ost4/test-ost4:<0xe02>
>         test-ost4/test-ost4:<0xe03>
>         test-ost4/test-ost4:<0xe04>
>         test-ost4/test-ost4:<0xe05>
>         test-ost4/test-ost4:<0xe06>.......
>     .......
>     .......continues......
>     .......
>     .......
>         test-ost4/test-ost4:<0xdfe>
>         test-ost4/test-ost4:<0xdff>
> ===================================
>
> Follow up questions,
>
> Is is better to not have a spare attached to the pool to prevent 
> resilvering in this scenario?  (bad HBA, disk failed during scrub, 
> resilver began, yet auto relplace was off.  The spare was assigned to 
> the zpool.)
>
> In a dual path to the jbod would the bad HBA card be disabled 
> automatically to prevent IO errors reaching the disk?  The current 
> setup is single path only.
>
>
> Thank you for any notes in advance,
> Kevin
>