[Lustre-discuss] Lustre Sparse Files question

Andreas Dilger andreas.dilger at oracle.com
Fri Jun 11 07:49:20 PDT 2010


On 2010-06-11, at 8:10, "Bradley W. Settlemyer"  
<settlemyerbw at ornl.gov> wrote:
> Great, it appears to do exactly what I need.  One more question:  What
> interactions with the MDS and OSTs does this IOCTL cause.  That is,  
> does
> this IOCTL even require an MDS access, or does it interact with the  
> OSTs
> only (obviously I've already opened the file causing an initial MDS  
> access)?

After the initial open, the FIEMAP ioctl is only generating RPCs to  
the OSTs. Normally this is only a single RPC per stripe, since the  
protocol can pack hundreds of extents into a single page (assuming the  
caller has a 1-page buffer to receive the extents).

> On 06/10/2010 06:30 PM, Andreas Dilger wrote:
>> On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote:
>>> Is there a mechanism within Lustre for querying the populated  
>>> extents
>>> in a sparse lustre file?  Perhaps some kind of bmap support or an  
>>> IOCTL
>>> for populating an extent map?
>>>
>>> I believe ZFS has support for SEEK_HOLE whences, but I didn't know  
>>> if
>>> Lustre has any mechanism to accomplish similar goals.
>>
>> On Linux, the equivalent (better?) interface is the FIEMAP ioctl,  
>> which returns a readdir-like list of extents into a user-supplied  
>> buffer.  We developed this for Lustre, because FIBMAP is wholly  
>> inefficient and inadequate to return millions of allocated blocks,  
>> and there is no way to express the blocks being stored on different  
>> devices.  Also, the FIEMAP ioctl does not need root permission,  
>> unlike the FIBMAP ioctl, so it is useful for regular users/tools.
>>
>> Subsequently the FIEMAP ioctl was adopted into the upstream kernel  
>> (with a huge amount of effort), and is now available for ext2/3/4/ 
>> xfs/reiserfs/btrfs for dumping extent maps to userspace.
>>
>> For displaying the FIEMAP data, the filefrag(8) tool was enhanced  
>> to use FIEMAP in preference to FIBMAP, if the underlying filesystem  
>> supports it.  In the lustre-patched e2fsprogs it correctly handles  
>> the presence of stripes on multiple backing devices.  Note that the  
>> output format shown below is an improved version that is not in any  
>> released e2fsprogs yet (it's in CVS though), but it will be in our  
>> next e2fsprogs release and has also been accepted upstream.  The  
>> FIEMAP ioctl is available in 1.8, and in some later versions of  
>> 1.6, but due to petty infighting when it was accepted upstream the  
>> data format was changed from our original version that is in older  
>> 1.6 releases, and they should not be used.
>>
>>
>> Note one major caveat when using FIEMAP on Lustre - it is currently  
>> implementing a slightly different output format than in the local- 
>> disk filesystems, because for fragmentation visualization (which is  
>> what it was originally intended for) it makes sense to display the  
>> layout in per-object order.  If the extents are presented in file- 
>> offset order there would appear to be fragmentation every 1MB in  
>> the file, even though they are allocated contiguously on disk.  For  
>> 1-stripe files this is irrelevant and the output is the same.
>>
>>
>> On the client with Lustre:
>>
>> [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/ 
>> Library6.iPhoto
>> Filesystem type is: bd00bd0
>> File size of /myth/images/Main Library/Library6.iPhoto is 30240622  
>> (29532 blocks of 1024 bytes)
>> ext:     device_logical:        physical_offset: length:  dev: flags:
>>   0:        0..   28671:  637502464.. 637531135:  28672: 0003:  
>> network
>>   1:    28672..   29531:  637669376.. 637670235:    860: 0003:  
>> network,eof
>> /myth/images/Main Library/Library6.iPhoto: 2 extents found
>>
>> [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/ 
>> Library6.iPhoto/myth/images/Main Library/Library6.iPhoto
>> lmm_stripe_count:   1
>> lmm_stripe_size:    1048576lmm_stripe_offset:  3
>>        obdidx           objid          objid            group
>>             3          341351        0x53567                0
>>
>>
>> On the server with local ldiskfs mount (for comparison, note '-k'  
>> argument to use 1024-byte blocks for output, otherwise it defaults  
>> to 4096-byte blocks to match the local filesystem blocksize):
>>
>> [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp
>> [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351
>> Filesystem type is: ef53
>> File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of  
>> 1024 bytes)
>> ext:     logical_offset:        physical_offset: length: flags:
>>   0:        0..   28671:  637502464.. 637531135:  28672:
>>   1:    28672..   29531:  637669376.. 637670235:    860: eof
>> /mnt/tmp/O/0/d7/341351: 2 extents found
>>
>>
>> If there are multiple stripes in a file it will show it with object  
>> offsets instead of file offsets:
>>
>> [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type  
>> is: bd00bd0
>> File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024  
>> bytes)
>> ext:     device_logical:        physical_offset: length:  dev: flags:
>>   0:        0..   14335:  179423232.. 179437567:  14336: 0003:  
>> network
>>   1:    14336..   28671:  179445760.. 179460095:  14336: 0003:  
>> network
>>   2:        0..    1023:   18482176..  18483199:   1024: 0000:  
>> network
>>   3:     1024..   24575:   18485248..  18508799:  23552: 0000:  
>> network
>>   4:        0..   24575:  331166720.. 331191295:  24576: 0004:  
>> network
>>   5:        0..    8191:  156459008.. 156467199:   8192: 0001:  
>> network
>>   6:     8192..   14335:  156502016.. 156508159:   6144: 0001:  
>> network
>>   7:    14336..   18431:  156622848.. 156626943:   4096: 0001:  
>> network
>>   8:    18432..   24575:  156516352.. 156522495:   6144: 0001:  
>> network
>> /myth/tmp/4stripe: 6 extents found
>>
>> [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe"
>> /myth/tmp/4stripe
>> lmm_magic:          0x0BD10BD0
>> lmm_object_gr:      0
>> lmm_object_id:      0x24dab9
>> lmm_stripe_count:   4
>> lmm_stripe_size:    4194304
>> lmm_stripe_pattern: 1
>> lmm_stripe_offset:  3
>>        obdidx           objid          objid            group
>>             3          340942        0x533ce                0
>>             0          744427        0xb5beb                0
>>             4           64720         0xfcd0                0
>>             1          602677        0x93235                0
>>
>> If you are using this for e.g. skipping sparse parts of the file  
>> you would need to do some extra work to convert the object offsets  
>> into stripe offsets.  Bug 13192 contains old patches for userspace  
>> helper functions that should do most of that work, if you are  
>> interested in taking a look at them.  It would also be possible to  
>> change Lustre to return extents in file offset order, but this  
>> would need a Lustre patch to implement (which is currently not a  
>> priority task).
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
>>
>
> -- 
> Brad Settlemyer
> Research Associate
> Oak Ridge National Laboratory



More information about the lustre-discuss mailing list