[Lustre-discuss] MDT backup (using tar) taking very long

Thu Sep 2 06:42:20 PDT 2010

Hi List,

we are currently reviewing our backup policy for our Lustre file system 
as backups of the MDT are taking longer and longer.

So far we are creating a LVM snapshot of our MDT, mount this via 
ldiskfs, run getfattr and getfacl followed by tar (RHEL5 version), 
basically following the instructions from the manual. The tar options 
include --sparse and --numeric-owner.

At the moment I've got a backup running where the tar process started on 
Tuesday, so it has now been running more than 24h. Including the 
getfattr and the getfacl calls (running in parallel) the whole backup 
has so far been running for more than 48h to backup the MDT for a 700GB 
MDT for a 214TB Lustre file system. The tar file created so far is about 
2GB compressed with gzip.

Tar is currently using anything between 30% and 100% cpu according to 
top, gzip is below 1% cpu usage, overall the MDS is fairly idle, load is 
about 1.2 on a 8 core machine, top reports this for the cpus.

<snip>
Cpu(s):  4.2%us,  4.5%sy,  0.0%ni, 85.8%id,  5.2%wa,  0.0%hi,  0.2%si, 
0.0%st
</snip>

vmstat is not showing any I/O worth mentioning, a few (10-1000) blocks 
per second.

Some file system details for the Lustre file system below. The MDS is 
running lustre 1.6.7.2.ddn3.5 plus a patch for bz #22820 on RHEL5.

[bnh65367 at cs04r-sc-com01-18 ~]$ lfs df -h
UUID                     bytes      Used Available  Use% Mounted on
lustre01-MDT0000_UUID    699.9G     22.1G    677.8G    3% 
/mnt/lustre01[MDT:0]
[snip]
filesystem summary:     214.9T    146.6T     68.3T   68% /mnt/lustre01

[bnh65367 at cs04r-sc-com01-18 ~]$ lfs df -ih
UUID                    Inodes     IUsed     IFree IUse% Mounted on
lustre01-MDT0000_UUID    200.0M     71.0M    129.0M   35% 
/mnt/lustre01[MDT:0]
[snip]
filesystem summary:     200.0M     71.0M    129.0M   35% /mnt/lustre01

Is this comparable to the backup times other people experience using tar?

Could this be because tar has to read the whole file (all zeros) in 
before deciding that this is a sparse file?

For comparison a backup using dd and gzip did 'only' takes about 8h and 
gzip was using 100% of one cpu core for all of that time, so using a 
faster compression algorithm this seems a much better option. Are there 
any dangerous downsides to this approach that I have missed?

Kind regards,
Frederik
-- 
Frederik Ferner
Computer Systems Administrator		phone: +44 1235 77 8624
Diamond Light Source Ltd.		mob:   +44 7917 08 5110
(Apologies in advance for the lines below. Some bits are a legal
requirement and I have no control over them.)