[Lustre-discuss] Lustre Sparse Files question
Andreas Dilger
andreas.dilger at oracle.com
Fri Jun 11 07:49:20 PDT 2010
On 2010-06-11, at 8:10, "Bradley W. Settlemyer"
<settlemyerbw at ornl.gov> wrote:
> Great, it appears to do exactly what I need. One more question: What
> interactions with the MDS and OSTs does this IOCTL cause. That is,
> does
> this IOCTL even require an MDS access, or does it interact with the
> OSTs
> only (obviously I've already opened the file causing an initial MDS
> access)?
After the initial open, the FIEMAP ioctl is only generating RPCs to
the OSTs. Normally this is only a single RPC per stripe, since the
protocol can pack hundreds of extents into a single page (assuming the
caller has a 1-page buffer to receive the extents).
> On 06/10/2010 06:30 PM, Andreas Dilger wrote:
>> On 2010-06-10, at 08:07, Bradley W. Settlemyer wrote:
>>> Is there a mechanism within Lustre for querying the populated
>>> extents
>>> in a sparse lustre file? Perhaps some kind of bmap support or an
>>> IOCTL
>>> for populating an extent map?
>>>
>>> I believe ZFS has support for SEEK_HOLE whences, but I didn't know
>>> if
>>> Lustre has any mechanism to accomplish similar goals.
>>
>> On Linux, the equivalent (better?) interface is the FIEMAP ioctl,
>> which returns a readdir-like list of extents into a user-supplied
>> buffer. We developed this for Lustre, because FIBMAP is wholly
>> inefficient and inadequate to return millions of allocated blocks,
>> and there is no way to express the blocks being stored on different
>> devices. Also, the FIEMAP ioctl does not need root permission,
>> unlike the FIBMAP ioctl, so it is useful for regular users/tools.
>>
>> Subsequently the FIEMAP ioctl was adopted into the upstream kernel
>> (with a huge amount of effort), and is now available for ext2/3/4/
>> xfs/reiserfs/btrfs for dumping extent maps to userspace.
>>
>> For displaying the FIEMAP data, the filefrag(8) tool was enhanced
>> to use FIEMAP in preference to FIBMAP, if the underlying filesystem
>> supports it. In the lustre-patched e2fsprogs it correctly handles
>> the presence of stripes on multiple backing devices. Note that the
>> output format shown below is an improved version that is not in any
>> released e2fsprogs yet (it's in CVS though), but it will be in our
>> next e2fsprogs release and has also been accepted upstream. The
>> FIEMAP ioctl is available in 1.8, and in some later versions of
>> 1.6, but due to petty infighting when it was accepted upstream the
>> data format was changed from our original version that is in older
>> 1.6 releases, and they should not be used.
>>
>>
>> Note one major caveat when using FIEMAP on Lustre - it is currently
>> implementing a slightly different output format than in the local-
>> disk filesystems, because for fragmentation visualization (which is
>> what it was originally intended for) it makes sense to display the
>> layout in per-object order. If the extents are presented in file-
>> offset order there would appear to be fragmentation every 1MB in
>> the file, even though they are allocated contiguously on disk. For
>> 1-stripe files this is irrelevant and the output is the same.
>>
>>
>> On the client with Lustre:
>>
>> [adilger at twoshoes]$ filefrag -v /myth/images/Main\ Library/
>> Library6.iPhoto
>> Filesystem type is: bd00bd0
>> File size of /myth/images/Main Library/Library6.iPhoto is 30240622
>> (29532 blocks of 1024 bytes)
>> ext: device_logical: physical_offset: length: dev: flags:
>> 0: 0.. 28671: 637502464.. 637531135: 28672: 0003:
>> network
>> 1: 28672.. 29531: 637669376.. 637670235: 860: 0003:
>> network,eof
>> /myth/images/Main Library/Library6.iPhoto: 2 extents found
>>
>> [adilger at twoshoes]$ lfs getstripe /myth/images/Main\ Library/
>> Library6.iPhoto/myth/images/Main Library/Library6.iPhoto
>> lmm_stripe_count: 1
>> lmm_stripe_size: 1048576lmm_stripe_offset: 3
>> obdidx objid objid group
>> 3 341351 0x53567 0
>>
>>
>> On the server with local ldiskfs mount (for comparison, note '-k'
>> argument to use 1024-byte blocks for output, otherwise it defaults
>> to 4096-byte blocks to match the local filesystem blocksize):
>>
>> [root at mookie]# mount -t ldiskfs /dev/vgmyth/lvmythost3 /mnt/tmp
>> [root at mookie]# filefrag -k -v /mnt/tmp/O/0/d$((341351 % 32))/341351
>> Filesystem type is: ef53
>> File size of /mnt/tmp/O/0/d7/341351 is 30240622 (29532 blocks of
>> 1024 bytes)
>> ext: logical_offset: physical_offset: length: flags:
>> 0: 0.. 28671: 637502464.. 637531135: 28672:
>> 1: 28672.. 29531: 637669376.. 637670235: 860: eof
>> /mnt/tmp/O/0/d7/341351: 2 extents found
>>
>>
>> If there are multiple stripes in a file it will show it with object
>> offsets instead of file offsets:
>>
>> [adilger at twoshoes]$ filefrag -v "/myth/tmp/4stripe" Filesystem type
>> is: bd00bd0
>> File size of /myth/tmp/4stripe is 104857600 (102400 blocks of 1024
>> bytes)
>> ext: device_logical: physical_offset: length: dev: flags:
>> 0: 0.. 14335: 179423232.. 179437567: 14336: 0003:
>> network
>> 1: 14336.. 28671: 179445760.. 179460095: 14336: 0003:
>> network
>> 2: 0.. 1023: 18482176.. 18483199: 1024: 0000:
>> network
>> 3: 1024.. 24575: 18485248.. 18508799: 23552: 0000:
>> network
>> 4: 0.. 24575: 331166720.. 331191295: 24576: 0004:
>> network
>> 5: 0.. 8191: 156459008.. 156467199: 8192: 0001:
>> network
>> 6: 8192.. 14335: 156502016.. 156508159: 6144: 0001:
>> network
>> 7: 14336.. 18431: 156622848.. 156626943: 4096: 0001:
>> network
>> 8: 18432.. 24575: 156516352.. 156522495: 6144: 0001:
>> network
>> /myth/tmp/4stripe: 6 extents found
>>
>> [adilger at twoshoes]$ lfs getstripe -v "/myth/tmp/4stripe"
>> /myth/tmp/4stripe
>> lmm_magic: 0x0BD10BD0
>> lmm_object_gr: 0
>> lmm_object_id: 0x24dab9
>> lmm_stripe_count: 4
>> lmm_stripe_size: 4194304
>> lmm_stripe_pattern: 1
>> lmm_stripe_offset: 3
>> obdidx objid objid group
>> 3 340942 0x533ce 0
>> 0 744427 0xb5beb 0
>> 4 64720 0xfcd0 0
>> 1 602677 0x93235 0
>>
>> If you are using this for e.g. skipping sparse parts of the file
>> you would need to do some extra work to convert the object offsets
>> into stripe offsets. Bug 13192 contains old patches for userspace
>> helper functions that should do most of that work, if you are
>> interested in taking a look at them. It would also be possible to
>> change Lustre to return extents in file offset order, but this
>> would need a Lustre patch to implement (which is currently not a
>> priority task).
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Lustre Technical Lead
>> Oracle Corporation Canada Inc.
>>
>>
>
> --
> Brad Settlemyer
> Research Associate
> Oak Ridge National Laboratory
More information about the lustre-discuss
mailing list