[lustre-discuss] non-empty files found when backing up MDT with LVM snapshots and tar

Fri Apr 22 21:07:41 PDT 2016

Nathan,
This should be fixed with a newer version of tar I think.  I don't recall if the sparse file handling was fixed in the 1.23 release or not.  There was a simple patch to tar to handle sparse files properly, but it has been a long time since I looked at this and it might take me some time to find. I think there is an updated version of tar in our HPDD Git "tools/tar" repo or similar (just on the plane so I can't check). 

The fact that the MDT inodes are storing a size is an artifact of 1.8, and is not considered a problem. There is definitely no data actually stored on the MDT inodes. The fact that the files have a data blocks located is likely because of wide striping stored in an external xattr block. 

Cheers, Andreas

> On Apr 21, 2016, at 17:35, Dauchy, Nathan (ARC-TNC)[Computer Sciences Corporation] <nathan.dauchy at nasa.gov> wrote:
> 
> Greetings All,
> 
> We are trying to set up File-Level Backup of our MDTs using LVM Snapshots.  The procedure used basically combines sections 17.3 and 17.5 from the manual.  (And thanks to A. Dilger for the note that LVM snapshots freeze the filesystem and flush the journal before creating the snapshot, so there shouldn't be anything in the [external] journal!)
> 
> Unfortunately, things are not working as well on a real system as they did on a simple testbed.  In particular, I have encountered a surprising situation that might be due to a Lustre bug, could be corruption of our MDT, is perhaps a deficiency in tar, or all of the above.  In general, I'm looking for any tips at all for speeding up tar backups. Read on for the gorey details...
> 
> CentOS-6.7
> lustre-2.5.3
> tar-1.23-13.el6
> MDT formatted with ldiskfs on LVM
> 
> The backup on one of our file systems is taking *much* longer than projected (scaling time with inode count from smaller systems).  The "tar" process is running near 100% of a CPU, going on 28 hours now for a filesystem with 33M inodes, yet the backup file isn't getting much bigger than the 788M it reached several hours ago.
> 
> Looking at the tar process, it is spending almost all it's time reading... nothing:
> 
> # ps uaxw | grep tar
> root     76727  0.0  0.0   3928   392 ?        S    Apr20   0:00 /usr/bin/time tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
> root     76728 98.3  0.0  22376  3060 ?        R    Apr20 1079:28 tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
> 
> # strace -p 76728 2>&1 | head -n 100000 | sort | uniq -c
>      1 Process 76728 attached
>  99999 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
> 
> The working file nominally has a multi-GB size, but that is expected with size-on-MDT:
> 
> # lsof -p 76728 | grep " 3r "
> tar     76728 root    3r   REG  253,6 29010071920 621442280 /mnt/snap/ROOT/userpath/somefile.hdf5
> # ls -lh /mnt/snap/ROOT/userpath/somefile.hdf5
> -rw-r--r-- 1 foo bar 28G Oct 21  2015 /mnt/snap/ROOT/userpath/somefile.hdf5
> 
> However, there are blocks allocated to it, which I did NOT expect:
> 
> # stat /mnt/snap/ROOT/userpath/somefile.hdf5
>  File: `/mnt/snap/ROOT/userpath/somefile.hdf5'
>  Size: 29010071920    Blocks: 8          IO Block: 4096   regular file
> Device: fd06h/64774d    Inode: 621442280   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (6666/  foo)   Gid: (7777/   bar)
> Access: 2015-10-21 00:14:31.007749412 -0700
> Modify: 2015-10-21 00:14:47.000000000 -0700
> Change: 2015-10-21 00:14:47.000000000 -0700
> 
> It turns out there are a whole lot of files on the MDT that are not actually empty:
> 
> # nohup find /mnt/snap/ROOT/ -type f -size +1 > /tmp/nonzero_files.out 2>&1 < /dev/null &
>  (...walked the whole MDT in under 20 minutes...)
> # wc -l /tmp/nonzero_files.out
> 717581 /tmp/nonzero_files.out
> 
> My theory at this point is that the few blocks allocated to each of those files on the MDT are enough to throw off the sparse file optimization handling that was added to tar a while ago:
>  https://bugzilla.lustre.org/show_bug.cgi?id=21376
>  https://jira.hpdd.intel.com/browse/LU-682
> 
> Reading gigs of zeros and compressing them might explain why things are taking so long and yet the mdt_backup tgz file is getting it's date stamp updated without getting much bigger.  tar has to read in the whole darn file, some of which are considerably larger than the example below, and hence it takes way too long!
> 
> 
> So, with all that said, I think it boils down to a few questions:
> 1) Is it expected to find files on the MDT that are not "0 Blocks"?
> 2) If not, how could they have gotten messed up, and is there any hope of fixing them?
> 3) Does anyone know tar well enough to think of how to improve the sparse file handling?
> 4) Would it make sense to make a custom lustre backup tar that notices the extended attributes for a file on MDT and assumes that the file is therefore empty?  (or for future data-on-MDT, doesn't try to read past the "local" data range)
> 5) Are there any other tricks that folks use to speed up file-level MDT backups?
> 
> Thanks much,
> Nathan
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org