[Lustre-discuss] File sizes on MDT

Andreas Dilger adilger at sun.com
Tue Jul 28 16:33:09 PDT 2009


On Jul 28, 2009  11:22 +0200, Thomas Roth wrote:
> Andreas Dilger wrote:
> > On Jul 27, 2009  14:24 +0200, Thomas Roth wrote:
> >> I'm copying around data between 2 MDTs in a test system. Having mounted
> >> the partitione as 'ldiskfs', I had a look in MDT/ROOT. I found all my
> >> test data there, but I'm puzzled by the indicated file sizes. For
> >> example I had put one of my holiday's movies, it's 40MB. On the
> >> ldiskfs-mounted MDT, I find a corresponding entry, which also has 40MB,
> >> as given by 'ls -lh'. Of course, the latter file doesn't have the
> >> contents of that movie, but why is it the same size? 'ls -li' also gives
> >> identical results, btw.
> >> On the other hand, there is another movie which is .6.4MB as such, but
> >> 0B on the MDT partition.
> > 
> > In Lustre 1.6.7 the "approximate" file size started to be stored on the
> > MDT inodes in order to facilitate[*] filesystem backup utilities to
> > allow them to have a fast estimate of the file size w/o having to access
> > the OST objects (that hold the authoritative size).  This size cannot
> > be used as the official file size in 1.x because there isn't sufficient
> > locking and recovery of the size in case of a crash, though a preview of
> > this feature (Size On MDS, SOM) will be available in the 2.0 release.
> 
> I get the impression that this feature hampers the device level backup -
> or is it file level backup: Extracting extended attributes and make a
> tar archive of the MDT: the latter step now takes 5 days on our
> production system (which is 1.6.7.1). And right now I'm trying to do a
> rsync - copy of that MDT. When that seemed to be stuck with a
> particular, I checked the file, on the source, albeit primitively with
> "ls -lh". That told me that the file was 9.1GB, and the rsync behaves
> just as you would expect when it has to transfer 9GB over the network -
> takes some time. In fact, there are several of these files, and as I
> mentioned, the MDT takes only 13GB on disk, so all of this is a bit
> confusing.

You are right.  In some use cases this feature has hampered backup.  It
is possible to use a block-device level backup (e.g. dd or dump) without
problems.  I use "dd" locally to do MDS backups so I didn't notice this
issue during testing.

> The first attempts to copy the MDT resulted immediately in a target file
> system blown up beyond proportions. I have since added the options
> "--sparse" to my rsync command line. Now the target system seems to keep
> small, but I have yet to see if the result could be used as an MDT at all.

It would also be possible to modify tar and rsync to use the "FIEMAP" support
available in newer versions of the kernel (2.6.27 at least), so that it
doesn't have to read all of the data from the file.  This would result in 
much faster backups for any kind of sparse files, but as yet that work
hasn't been done.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list