[lustre-discuss] non-empty files found when backing up MDT with LVM snapshots and tar

Sat Apr 23 09:13:41 PDT 2016

Andreas,

I appreciate the pointers!  After investigating various tar versions, I found that there are a couple different optimizations for handling of sparse files.  I also found this interesting tidbit on the "--sparse" option wich helps explain the performance problem... for files with allocated blocks, tar might actually be reading the whole file *twice*:
    https://lists.freedesktop.org/archives/systemd-devel/2015-August/033935.html

* The simple patch to deal with files that have 0 blocks is included in the CentOS-6.7 version of tar-1.23:
    https://bugzilla.lustre.org/show_bug.cgi?id=21376
    https://jira.hpdd.intel.com/browse/LU-682

* The latest tar release (1.28) still seems to have the issue of needing to read the whole file IF there are any blocks allocated to it.

* The master branch of tar has new and improved handling using SEEK_HOLE/SEEK_DATA:
    http://git.savannah.gnu.org/cgit/tar.git/tree/src/sparse.c
    http://git.savannah.gnu.org/cgit/tar.git/commit/?id=b684326e6958f3a8a58202df933e925571d2fcbf

* That patch is included in the 1.28.90 "alpha" release.
    http://alpha.gnu.org/gnu/tar/tar-1.28.90.tar.gz
  There was a "plan to release 1.29 within 7-10 days", but that was over a month ago and there were a few additional bug reports posted since then:
    http://comments.gmane.org/gmane.comp.gnu.tar.bugs/6156

* Back in 2011, you floated the idea of using FIEMAP as an optimization instead:
    https://lists.gnu.org/archive/html/bug-tar/2011-02/msg00025.html
Did that ever go anywhere?
Is intel still supporting this "Lustre-enhanced GNU Tar"?  Or is the git "tools/tar" repo you refer to somwehere else?
    https://wiki.hpdd.intel.com/display/PUB/Lustre+Releases
    https://build.hpdd.intel.com/job/lustre-tar-master/

Next week I plan to try out the alpha version of tar, though I'm not sure if we want to use that in production.  I can also test various stripe widths to confirm the theory that wide striping triggers an external block for xattrs.  Stay tuned for results.

Thanks,
Nathan

________________________________________
From: Dilger, Andreas [andreas.dilger at intel.com]
Sent: Friday, April 22, 2016 10:07 PM
To: Dauchy, Nathan (ARC-TNC)[Computer Sciences Corporation]
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] non-empty files found when backing up MDT with LVM        snapshots and tar

Nathan,
This should be fixed with a newer version of tar I think.  I don't recall if the sparse file handling was fixed in the 1.23 release or not.  There was a simple patch to tar to handle sparse files properly, but it has been a long time since I looked at this and it might take me some time to find. I think there is an updated version of tar in our HPDD Git "tools/tar" repo or similar (just on the plane so I can't check).

The fact that the MDT inodes are storing a size is an artifact of 1.8, and is not considered a problem. There is definitely no data actually stored on the MDT inodes. The fact that the files have a data blocks located is likely because of wide striping stored in an external xattr block.

Cheers, Andreas

> On Apr 21, 2016, at 17:35, Dauchy, Nathan (ARC-TNC)[Computer Sciences Corporation] <nathan.dauchy at nasa.gov> wrote:
>
> Greetings All,
>
> We are trying to set up File-Level Backup of our MDTs using LVM Snapshots.  The procedure used basically combines sections 17.3 and 17.5 from the manual.  (And thanks to A. Dilger for the note that LVM snapshots freeze the filesystem and flush the journal before creating the snapshot, so there shouldn't be anything in the [external] journal!)
>
> Unfortunately, things are not working as well on a real system as they did on a simple testbed.  In particular, I have encountered a surprising situation that might be due to a Lustre bug, could be corruption of our MDT, is perhaps a deficiency in tar, or all of the above.  In general, I'm looking for any tips at all for speeding up tar backups. Read on for the gorey details...
>
> CentOS-6.7
> lustre-2.5.3
> tar-1.23-13.el6
> MDT formatted with ldiskfs on LVM
>
> The backup on one of our file systems is taking *much* longer than projected (scaling time with inode count from smaller systems).  The "tar" process is running near 100% of a CPU, going on 28 hours now for a filesystem with 33M inodes, yet the backup file isn't getting much bigger than the 788M it reached several hours ago.
>
> Looking at the tar process, it is spending almost all it's time reading... nothing:
>
> # ps uaxw | grep tar
> root     76727  0.0  0.0   3928   392 ?        S    Apr20   0:00 /usr/bin/time tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
> root     76728 98.3  0.0  22376  3060 ?        R    Apr20 1079:28 tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
>
> # strace -p 76728 2>&1 | head -n 100000 | sort | uniq -c
>      1 Process 76728 attached
>  99999 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
>
> The working file nominally has a multi-GB size, but that is expected with size-on-MDT:
>
> # lsof -p 76728 | grep " 3r "
> tar     76728 root    3r   REG  253,6 29010071920 621442280 /mnt/snap/ROOT/userpath/somefile.hdf5
> # ls -lh /mnt/snap/ROOT/userpath/somefile.hdf5
> -rw-r--r-- 1 foo bar 28G Oct 21  2015 /mnt/snap/ROOT/userpath/somefile.hdf5
>
> However, there are blocks allocated to it, which I did NOT expect:
>
> # stat /mnt/snap/ROOT/userpath/somefile.hdf5
>  File: `/mnt/snap/ROOT/userpath/somefile.hdf5'
>  Size: 29010071920    Blocks: 8          IO Block: 4096   regular file
> Device: fd06h/64774d    Inode: 621442280   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (6666/  foo)   Gid: (7777/   bar)
> Access: 2015-10-21 00:14:31.007749412 -0700
> Modify: 2015-10-21 00:14:47.000000000 -0700
> Change: 2015-10-21 00:14:47.000000000 -0700
>
> It turns out there are a whole lot of files on the MDT that are not actually empty:
>
> # nohup find /mnt/snap/ROOT/ -type f -size +1 > /tmp/nonzero_files.out 2>&1 < /dev/null &
>  (...walked the whole MDT in under 20 minutes...)
> # wc -l /tmp/nonzero_files.out
> 717581 /tmp/nonzero_files.out
>
> My theory at this point is that the few blocks allocated to each of those files on the MDT are enough to throw off the sparse file optimization handling that was added to tar a while ago:
>  https://bugzilla.lustre.org/show_bug.cgi?id=21376
>  https://jira.hpdd.intel.com/browse/LU-682
>
> Reading gigs of zeros and compressing them might explain why things are taking so long and yet the mdt_backup tgz file is getting it's date stamp updated without getting much bigger.  tar has to read in the whole darn file, some of which are considerably larger than the example below, and hence it takes way too long!
>
>
> So, with all that said, I think it boils down to a few questions:
> 1) Is it expected to find files on the MDT that are not "0 Blocks"?
> 2) If not, how could they have gotten messed up, and is there any hope of fixing them?
> 3) Does anyone know tar well enough to think of how to improve the sparse file handling?
> 4) Would it make sense to make a custom lustre backup tar that notices the extended attributes for a file on MDT and assumes that the file is therefore empty?  (or for future data-on-MDT, doesn't try to read past the "local" data range)
> 5) Are there any other tricks that folks use to speed up file-level MDT backups?
>
> Thanks much,
> Nathan
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org