[Lustre-discuss] "obdidx" ordering in "lfs getstripe"

Tue Feb 14 06:07:49 PST 2012

On Feb 14, 2012, at 6:51 AM, Jack David wrote:

> On Tue, Feb 14, 2012 at 6:57 PM, Kevin Van Maren <KVanMaren at fusionio.com> wrote:
>> On Feb 14, 2012, at 12:13 AM, Jack David wrote:
>> 
>>> On Thu, Feb 9, 2012 at 8:18 PM, Andreas Dilger <adilger at whamcloud.com> wrote:
>>>> On 2012-02-09, at 6:20 AM, Jack David wrote:
>>>>> In the output of "lsf getstripe <filename> | <dirname>", the obdidx
>>>>> denotes the OST index (I assume).
>>>>> 
>>>>> Consider the following output:
>>>>> 
>>>>> lmm_stripe_count:   2
>>>>> lmm_stripe_size:    1048576
>>>>> lmm_stripe_offset:  1
>>>>>       obdidx           objid          objid            group
>>>>>            1               2            0x2                0
>>>>>            0               3            0x3                0
>>>>> 
>>>>> where I have a setup consisting of two OSTs. If I have more than two
>>>>> OSTs, is it possible that I get the obdidx values out of order? Or the
>>>>> obdidx values will always be linear?
>>>>> 
>>>>> For example, in above output, the values are linear (like 1, 0 - and
>>>>> this pattern will be repeated while storing the data I assume). If I
>>>>> have 4 OSTs, can the values be non-linear? Something like 2,0,1,3 or
>>>>> 2,1,3,0 (or any pattern for that matter)??
>>>> 
>>>> Typically the ordering will be linear, but this depends on a number of
>>>> different factors:
>>>> - what order the OSTs were created in:  without --index=N the OST order
>>>>  depends on the order in which they were first mounted, so using --index
>>>>  is always recommended, and will be mandatory in the future
>>>> - the distribution of OSTs among OSS nodes:  the MDS object allocator
>>>>  will normally select one OST from each OSS before allocating another
>>>>  object from a different OST on the same OSS
>>> 
>>> Thanks for this information.
>>> 
>>>> - the space available on each OST:  when OST free space is imbalanced
>>>>  the OSTs will be selected in part based on how full they are
>>> 
>>> I have a doubt here. Lets say I have 4 OSTs, but the lustre client is
>>> issuing the write request having which can be accommodated by any
>>> single OST (e.g. write request is of size 512bytes and stripe_size is
>>> 1MB). In this case, how will the data be stored? Will the MDS maintain
>>> the index of next OST which should serve the request?
>> 
>> 
>> I think you are still confused about how it works.  The OSTs are selected
>> _when the file is created_.  The striping is a static map of offset to OST.
>> For example, if the stripe count = 2, and the stripe size = 1MB, then
>> 0-1MB goes to the first OST, 1-2MB goes to the second, 2-3 goes to the first, etc.
>> 
> I understand that, but just got curious that does lustre client keeps
> track of which is the _next_ OST where the IO request should go to? I

No, it does not track the "next", as that depends on the file offset.  For example,
with the 2-OST stripe example in my previous email, if the client writes 0-1MB,
2-3MB, and 4-5MB, all the data will be written to a single OST.

> am unaware that who decides the stripe_size at the time of file
> creation (by default is 1MB - from lfs setstripe man page), so I
> assume client is not bothered about that. But if the client is
> generating the write request which is not in multiple of stripe_size,
> multiple write requests can be and stored into one OST (e.g. if stripe
> size is 1MB, then 20 req of 512bytes can be stored in OST1, next 20
> reqs on OST2 and likewise).

1MB is the default default, but the actual default can vary system to system.

The file stripe is determined when the file is created.  "lfs setstripe" can
be used to create a file with a specified striping.

"lfs setstripe" can aso be used to change the striping for a directory, which is
quite useful as that determines the default stripe for any files created in
that directory (including directories!)

When the client opens a file, the MDT returns the stripe information to the
client so that the client knows how to map file offsets to OST objects (and
the offset in that object).  It is the client's job (inside Lustre so it is automatic)
to figure out how to map a read/write to the server/ost/object/offset.

Kevin

> Actually I am trying to understand how can I leverage the pNFS file
> layout semantics (which communicates to Data Servers directly once the
> layout is supplied by Meta Data Server) with Lustre Filesystem, and
> that is the source of such questions.
> 
>> The free space impacts _which_ OSTs are selected when a file is created,
>> it does NOT impact where data is written once a file a created.  So if an OST
>> fills up, every file that resides on that OST will be unable to grow if the growth is
>> to an offset that maps to that OST.
>> 
> 
> Good to know that.
> 
>> Kevin
>> 
>> 
>> Confidentiality Notice: This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.
> 
> 
> 
> -- 
> J

This e-mail message, its contents and any attachments to it are confidential to the intended recipient, and may contain information that is privileged and/or exempt from disclosure under applicable law. If you are not the intended recipient, please immediately notify the sender and destroy the original e-mail message and any attachments (and any copies that may have been made) from your system or otherwise. Any unauthorized use, copying, disclosure or distribution of this information is strictly prohibited.  Email addresses that end with a ?-c? identify the sender as a Fusion-io contractor.