[Lustre-discuss] Lustre Sparse Files question

Bradley W. Settlemyer settlemyerbw at ornl.gov
Fri Jun 11 07:10:52 PDT 2010


Great, it appears to do exactly what I need.  One more question:  What
interactions with the MDS and OSTs does this IOCTL cause.  That is, does
this IOCTL even require an MDS access, or does it interact with the OSTs
only (obviously I've already opened the file causing an initial MDS access)?

Cheers,
Brad


On 06/10/2010 06:30 PM, Andreas Dilger wrote:
> On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote:
>>  Is there a mechanism within Lustre for querying the populated extents
>> in a sparse lustre file?  Perhaps some kind of bmap support or an IOCTL
>> for populating an extent map?
>>
>>  I believe ZFS has support for SEEK_HOLE whences, but I didn't know if
>> Lustre has any mechanism to accomplish similar goals.
> 
> On Linux, the equivalent (better?) interface is the FIEMAP ioctl, which returns a readdir-like list of extents into a user-supplied buffer.  We developed this for Lustre, because FIBMAP is wholly inefficient and inadequate to return millions of allocated blocks, and there is no way to express the blocks being stored on different devices.  Also, the FIEMAP ioctl does not need root permission, unlike the FIBMAP ioctl, so it is useful for regular users/tools.
> 
> Subsequently the FIEMAP ioctl was adopted into the upstream kernel (with a huge amount of effort), and is now available for ext2/3/4/xfs/reiserfs/btrfs for dumping extent maps to userspace.
> 
> For displaying the FIEMAP data, the filefrag(8) tool was enhanced to use FIEMAP in preference to FIBMAP, if the underlying filesystem supports it.  In the lustre-patched e2fsprogs it correctly handles the presence of stripes on multiple backing devices.  Note that the output format shown below is an improved version that is not in any released e2fsprogs yet (it's in CVS though), but it will be in our next e2fsprogs release and has also been accepted upstream.  The FIEMAP ioctl is available in 1.8, and in some later versions of 1.6, but due to petty infighting when it was accepted upstream the data format was changed from our original version that is in older 1.6 releases, and they should not be used.
> 
> 
> Note one major caveat when using FIEMAP on Lustre - it is currently implementing a slightly different output format than in the local-disk filesystems, because for fragmentation visualization (which is what it was originally intended for) it makes sense to display the layout in per-object order.  If the extents are presented in file-offset order there would appear to be fragmentation every 1MB in the file, even though they are allocated contiguously on disk.  For 1-stripe files this is irrelevant and the output is the same.
> 
> 
> On the client with Lustre:
> 
> [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/Library6.iPhoto 
> Filesystem type is: bd00bd0
> File size of /myth/images/Main Library/Library6.iPhoto is 30240622 (29532 blocks of 1024 bytes)
>  ext:     device_logical:        physical_offset: length:  dev: flags:
>    0:        0..   28671:  637502464.. 637531135:  28672: 0003: network
>    1:    28672..   29531:  637669376.. 637670235:    860: 0003: network,eof
> /myth/images/Main Library/Library6.iPhoto: 2 extents found
> 
> [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/Library6.iPhoto/myth/images/Main Library/Library6.iPhoto
> lmm_stripe_count:   1
> lmm_stripe_size:    1048576lmm_stripe_offset:  3
>         obdidx           objid          objid            group
>              3          341351        0x53567                0
> 
> 
> On the server with local ldiskfs mount (for comparison, note '-k' argument to use 1024-byte blocks for output, otherwise it defaults to 4096-byte blocks to match the local filesystem blocksize):
> 
> [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp
> [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351
> Filesystem type is: ef53
> File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of 1024 bytes)
>  ext:     logical_offset:        physical_offset: length: flags:
>    0:        0..   28671:  637502464.. 637531135:  28672: 
>    1:    28672..   29531:  637669376.. 637670235:    860: eof
> /mnt/tmp/O/0/d7/341351: 2 extents found
> 
> 
> If there are multiple stripes in a file it will show it with object offsets instead of file offsets:
> 
> [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type is: bd00bd0
> File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024 bytes)
>  ext:     device_logical:        physical_offset: length:  dev: flags:
>    0:        0..   14335:  179423232.. 179437567:  14336: 0003: network
>    1:    14336..   28671:  179445760.. 179460095:  14336: 0003: network
>    2:        0..    1023:   18482176..  18483199:   1024: 0000: network
>    3:     1024..   24575:   18485248..  18508799:  23552: 0000: network
>    4:        0..   24575:  331166720.. 331191295:  24576: 0004: network
>    5:        0..    8191:  156459008.. 156467199:   8192: 0001: network
>    6:     8192..   14335:  156502016.. 156508159:   6144: 0001: network
>    7:    14336..   18431:  156622848.. 156626943:   4096: 0001: network
>    8:    18432..   24575:  156516352.. 156522495:   6144: 0001: network
> /myth/tmp/4stripe: 6 extents found
> 
> [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe"
> /myth/tmp/4stripe
> lmm_magic:          0x0BD10BD0
> lmm_object_gr:      0
> lmm_object_id:      0x24dab9
> lmm_stripe_count:   4
> lmm_stripe_size:    4194304
> lmm_stripe_pattern: 1
> lmm_stripe_offset:  3
>         obdidx           objid          objid            group
>              3          340942        0x533ce                0
>              0          744427        0xb5beb                0
>              4           64720         0xfcd0                0
>              1          602677        0x93235                0
> 
> If you are using this for e.g. skipping sparse parts of the file you would need to do some extra work to convert the object offsets into stripe offsets.  Bug 13192 contains old patches for userspace helper functions that should do most of that work, if you are interested in taking a look at them.  It would also be possible to change Lustre to return extents in file offset order, but this would need a Lustre patch to implement (which is currently not a priority task).   
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
> 
> 

-- 
Brad Settlemyer
Research Associate
Oak Ridge National Laboratory



More information about the lustre-discuss mailing list