[lustre-discuss] non-empty files found when backing up MDT with LVM snapshots and tar
Dauchy, Nathan (ARC-TNC)[Computer Sciences Corporation]
nathan.dauchy at nasa.gov
Thu Apr 21 17:35:06 PDT 2016
Greetings All,
We are trying to set up File-Level Backup of our MDTs using LVM Snapshots. The procedure used basically combines sections 17.3 and 17.5 from the manual. (And thanks to A. Dilger for the note that LVM snapshots freeze the filesystem and flush the journal before creating the snapshot, so there shouldn't be anything in the [external] journal!)
Unfortunately, things are not working as well on a real system as they did on a simple testbed. In particular, I have encountered a surprising situation that might be due to a Lustre bug, could be corruption of our MDT, is perhaps a deficiency in tar, or all of the above. In general, I'm looking for any tips at all for speeding up tar backups. Read on for the gorey details...
CentOS-6.7
lustre-2.5.3
tar-1.23-13.el6
MDT formatted with ldiskfs on LVM
The backup on one of our file systems is taking *much* longer than projected (scaling time with inode count from smaller systems). The "tar" process is running near 100% of a CPU, going on 28 hours now for a filesystem with 33M inodes, yet the backup file isn't getting much bigger than the 788M it reached several hours ago.
Looking at the tar process, it is spending almost all it's time reading... nothing:
# ps uaxw | grep tar
root 76727 0.0 0.0 3928 392 ? S Apr20 0:00 /usr/bin/time tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
root 76728 98.3 0.0 22376 3060 ? R Apr20 1079:28 tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
# strace -p 76728 2>&1 | head -n 100000 | sort | uniq -c
1 Process 76728 attached
99999 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
The working file nominally has a multi-GB size, but that is expected with size-on-MDT:
# lsof -p 76728 | grep " 3r "
tar 76728 root 3r REG 253,6 29010071920 621442280 /mnt/snap/ROOT/userpath/somefile.hdf5
# ls -lh /mnt/snap/ROOT/userpath/somefile.hdf5
-rw-r--r-- 1 foo bar 28G Oct 21 2015 /mnt/snap/ROOT/userpath/somefile.hdf5
However, there are blocks allocated to it, which I did NOT expect:
# stat /mnt/snap/ROOT/userpath/somefile.hdf5
File: `/mnt/snap/ROOT/userpath/somefile.hdf5'
Size: 29010071920 Blocks: 8 IO Block: 4096 regular file
Device: fd06h/64774d Inode: 621442280 Links: 1
Access: (0644/-rw-r--r--) Uid: (6666/ foo) Gid: (7777/ bar)
Access: 2015-10-21 00:14:31.007749412 -0700
Modify: 2015-10-21 00:14:47.000000000 -0700
Change: 2015-10-21 00:14:47.000000000 -0700
It turns out there are a whole lot of files on the MDT that are not actually empty:
# nohup find /mnt/snap/ROOT/ -type f -size +1 > /tmp/nonzero_files.out 2>&1 < /dev/null &
(...walked the whole MDT in under 20 minutes...)
# wc -l /tmp/nonzero_files.out
717581 /tmp/nonzero_files.out
My theory at this point is that the few blocks allocated to each of those files on the MDT are enough to throw off the sparse file optimization handling that was added to tar a while ago:
https://bugzilla.lustre.org/show_bug.cgi?id=21376
https://jira.hpdd.intel.com/browse/LU-682
Reading gigs of zeros and compressing them might explain why things are taking so long and yet the mdt_backup tgz file is getting it's date stamp updated without getting much bigger. tar has to read in the whole darn file, some of which are considerably larger than the example below, and hence it takes way too long!
So, with all that said, I think it boils down to a few questions:
1) Is it expected to find files on the MDT that are not "0 Blocks"?
2) If not, how could they have gotten messed up, and is there any hope of fixing them?
3) Does anyone know tar well enough to think of how to improve the sparse file handling?
4) Would it make sense to make a custom lustre backup tar that notices the extended attributes for a file on MDT and assumes that the file is therefore empty? (or for future data-on-MDT, doesn't try to read past the "local" data range)
5) Are there any other tricks that folks use to speed up file-level MDT backups?
Thanks much,
Nathan
More information about the lustre-discuss
mailing list