[Lustre-discuss] Lustre Sparse Files question

Thu Jun 10 15:30:56 PDT 2010

On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote:
>  Is there a mechanism within Lustre for querying the populated extents
> in a sparse lustre file?  Perhaps some kind of bmap support or an IOCTL
> for populating an extent map?
> 
>  I believe ZFS has support for SEEK_HOLE whences, but I didn't know if
> Lustre has any mechanism to accomplish similar goals.

On Linux, the equivalent (better?) interface is the FIEMAP ioctl, which returns a readdir-like list of extents into a user-supplied buffer.  We developed this for Lustre, because FIBMAP is wholly inefficient and inadequate to return millions of allocated blocks, and there is no way to express the blocks being stored on different devices.  Also, the FIEMAP ioctl does not need root permission, unlike the FIBMAP ioctl, so it is useful for regular users/tools.

Subsequently the FIEMAP ioctl was adopted into the upstream kernel (with a huge amount of effort), and is now available for ext2/3/4/xfs/reiserfs/btrfs for dumping extent maps to userspace.

For displaying the FIEMAP data, the filefrag(8) tool was enhanced to use FIEMAP in preference to FIBMAP, if the underlying filesystem supports it.  In the lustre-patched e2fsprogs it correctly handles the presence of stripes on multiple backing devices.  Note that the output format shown below is an improved version that is not in any released e2fsprogs yet (it's in CVS though), but it will be in our next e2fsprogs release and has also been accepted upstream.  The FIEMAP ioctl is available in 1.8, and in some later versions of 1.6, but due to petty infighting when it was accepted upstream the data format was changed from our original version that is in older 1.6 releases, and they should not be used.

Note one major caveat when using FIEMAP on Lustre - it is currently implementing a slightly different output format than in the local-disk filesystems, because for fragmentation visualization (which is what it was originally intended for) it makes sense to display the layout in per-object order.  If the extents are presented in file-offset order there would appear to be fragmentation every 1MB in the file, even though they are allocated contiguously on disk.  For 1-stripe files this is irrelevant and the output is the same.

On the client with Lustre:

[adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/Library6.iPhoto 
Filesystem type is: bd00bd0
File size of /myth/images/Main Library/Library6.iPhoto is 30240622 (29532 blocks of 1024 bytes)
 ext:     device_logical:        physical_offset: length:  dev: flags:
   0:        0..   28671:  637502464.. 637531135:  28672: 0003: network
   1:    28672..   29531:  637669376.. 637670235:    860: 0003: network,eof
/myth/images/Main Library/Library6.iPhoto: 2 extents found

[adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/Library6.iPhoto/myth/images/Main Library/Library6.iPhoto
lmm_stripe_count:   1
lmm_stripe_size:    1048576lmm_stripe_offset:  3
        obdidx           objid          objid            group
             3          341351        0x53567                0

On the server with local ldiskfs mount (for comparison, note '-k' argument to use 1024-byte blocks for output, otherwise it defaults to 4096-byte blocks to match the local filesystem blocksize):

[root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp
[root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351
Filesystem type is: ef53
File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length: flags:
   0:        0..   28671:  637502464.. 637531135:  28672: 
   1:    28672..   29531:  637669376.. 637670235:    860: eof
/mnt/tmp/O/0/d7/341351: 2 extents found

If there are multiple stripes in a file it will show it with object offsets instead of file offsets:

[adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type is: bd00bd0
File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes)
 ext:     device_logical:        physical_offset: length:  dev: flags:
   0:        0..   14335:  179423232.. 179437567:  14336: 0003: network
   1:    14336..   28671:  179445760.. 179460095:  14336: 0003: network
   2:        0..    1023:   18482176..  18483199:   1024: 0000: network
   3:     1024..   24575:   18485248..  18508799:  23552: 0000: network
   4:        0..   24575:  331166720.. 331191295:  24576: 0004: network
   5:        0..    8191:  156459008.. 156467199:   8192: 0001: network
   6:     8192..   14335:  156502016.. 156508159:   6144: 0001: network
   7:    14336..   18431:  156622848.. 156626943:   4096: 0001: network
   8:    18432..   24575:  156516352.. 156522495:   6144: 0001: network
/myth/tmp/4stripe: 6 extents found

[adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe"
/myth/tmp/4stripe
lmm_magic:          0x0BD10BD0
lmm_object_gr:      0
lmm_object_id:      0x24dab9
lmm_stripe_count:   4
lmm_stripe_size:    4194304
lmm_stripe_pattern: 1
lmm_stripe_offset:  3
        obdidx           objid          objid            group
             3          340942        0x533ce                0
             0          744427        0xb5beb                0
             4           64720         0xfcd0                0
             1          602677        0x93235                0

If you are using this for e.g. skipping sparse parts of the file you would need to do some extra work to convert the object offsets into stripe offsets.  Bug 13192 contains old patches for userspace helper functions that should do most of that work, if you are interested in taking a look at them.  It would also be possible to change Lustre to return extents in file offset order, but this would need a Lustre patch to implement (which is currently not a priority task).   

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.