[lustre-discuss] non-empty files found when backing up MDT with LVM snapshots and tar

Thu Apr 21 17:35:06 PDT 2016

Greetings All,

We are trying to set up File-Level Backup of our MDTs using LVM Snapshots.  The procedure used basically combines sections 17.3 and 17.5 from the manual.  (And thanks to A. Dilger for the note that LVM snapshots freeze the filesystem and flush the journal before creating the snapshot, so there shouldn't be anything in the [external] journal!)

Unfortunately, things are not working as well on a real system as they did on a simple testbed.  In particular, I have encountered a surprising situation that might be due to a Lustre bug, could be corruption of our MDT, is perhaps a deficiency in tar, or all of the above.  In general, I'm looking for any tips at all for speeding up tar backups. Read on for the gorey details...

CentOS-6.7
lustre-2.5.3
tar-1.23-13.el6
MDT formatted with ldiskfs on LVM

The backup on one of our file systems is taking *much* longer than projected (scaling time with inode count from smaller systems).  The "tar" process is running near 100% of a CPU, going on 28 hours now for a filesystem with 33M inodes, yet the backup file isn't getting much bigger than the 788M it reached several hours ago.

Looking at the tar process, it is spending almost all it's time reading... nothing:

# ps uaxw | grep tar
root     76727  0.0  0.0   3928   392 ?        S    Apr20   0:00 /usr/bin/time tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .
root     76728 98.3  0.0  22376  3060 ?        R    Apr20 1079:28 tar -czf /tmp/mdt_backup.tgz --posix --sparse --xattrs --totals --exclude ROOT/lost+found/*duplicate* .

# strace -p 76728 2>&1 | head -n 100000 | sort | uniq -c
      1 Process 76728 attached
  99999 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512

The working file nominally has a multi-GB size, but that is expected with size-on-MDT:

# lsof -p 76728 | grep " 3r "
tar     76728 root    3r   REG  253,6 29010071920 621442280 /mnt/snap/ROOT/userpath/somefile.hdf5
# ls -lh /mnt/snap/ROOT/userpath/somefile.hdf5
-rw-r--r-- 1 foo bar 28G Oct 21  2015 /mnt/snap/ROOT/userpath/somefile.hdf5

However, there are blocks allocated to it, which I did NOT expect:

# stat /mnt/snap/ROOT/userpath/somefile.hdf5
  File: `/mnt/snap/ROOT/userpath/somefile.hdf5'
  Size: 29010071920    Blocks: 8          IO Block: 4096   regular file
Device: fd06h/64774d    Inode: 621442280   Links: 1
Access: (0644/-rw-r--r--)  Uid: (6666/  foo)   Gid: (7777/   bar)
Access: 2015-10-21 00:14:31.007749412 -0700
Modify: 2015-10-21 00:14:47.000000000 -0700
Change: 2015-10-21 00:14:47.000000000 -0700

It turns out there are a whole lot of files on the MDT that are not actually empty:

# nohup find /mnt/snap/ROOT/ -type f -size +1 > /tmp/nonzero_files.out 2>&1 < /dev/null &
  (...walked the whole MDT in under 20 minutes...)
# wc -l /tmp/nonzero_files.out
717581 /tmp/nonzero_files.out

My theory at this point is that the few blocks allocated to each of those files on the MDT are enough to throw off the sparse file optimization handling that was added to tar a while ago:
  https://bugzilla.lustre.org/show_bug.cgi?id=21376
  https://jira.hpdd.intel.com/browse/LU-682

Reading gigs of zeros and compressing them might explain why things are taking so long and yet the mdt_backup tgz file is getting it's date stamp updated without getting much bigger.  tar has to read in the whole darn file, some of which are considerably larger than the example below, and hence it takes way too long!

So, with all that said, I think it boils down to a few questions:
1) Is it expected to find files on the MDT that are not "0 Blocks"?
2) If not, how could they have gotten messed up, and is there any hope of fixing them?
3) Does anyone know tar well enough to think of how to improve the sparse file handling?
4) Would it make sense to make a custom lustre backup tar that notices the extended attributes for a file on MDT and assumes that the file is therefore empty?  (or for future data-on-MDT, doesn't try to read past the "local" data range)
5) Are there any other tricks that folks use to speed up file-level MDT backups?

Thanks much,
Nathan