[Lustre-discuss] how the lustre distribute data among disks within one OST

Dilger, Andreas andreas.dilger at intel.com
Fri Jun 14 14:47:30 PDT 2013

On 2013/13/06 6:36 PM, "Jaln" <valiantljk at gmail.com> wrote:

>Thank you Chris, I'm sort of clear now.
>In my question, stripe 0,4 means one process wants to access stripe 0 and
>4 at the same time.
>there is another process wants to access both stripe 0 and 2,

Just to clarify the Lustre terminology here, if there are only 2 OSTs
there will only be two stripes, with index "0" and "1" (each with an
object ID), one on each OST.  In your case, each one will be an object of
in size.

>even though stripe 0, 2, 4 are in the same place (one file),
>but their offsets are different, i.e., 0 and 2 are contiguous,
>while from 0 to 4 there is a gap.

Right, this is no different than an application reading from megabytes 0,1
or 0,2
from a local disk filesystem.  There will be a seek in the middle, unless
client, OSS, or RAID/disk decide to do readahead on the file or object.
If the
file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre
will just prefetch the whole file on first access.

>So my concern is, will the two processes have different I/O cost?
>In other words, accessing 0 and 4 would take longer time than accessing 0
>and 2.

Sure, one seek per MB accessed (<= 10ms), but this is relatively close
to the network transfer time (10ms per MB for 1GigE, 1ms per MB for
10GigE), and
this can all be pipelined by the client, which sends up to 8 RPCs
for each OST.

Cheers, Andreas

>On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone
><morrone2 at llnl.gov> wrote:
>In that case, it is the question part that I do not understand. :)  What
>is "stripe 0,4", why could it be "closer" then "stripe 0,2"?  In your
>example, 0, 2, and 4 are all in the same place.
>If you file is striped over 2 OSTs, then essentially what happens behind
>the scenes is that there are two files, one on each OST.  But Lustre
>hides that from you, as a user.  Lustre basically does modulo operations
>to translate a file offset from the file that
> it presents to the user, into which ost and offset into said ost's file
>to use.
>Does that help at all?
>On 06/13/2013 02:58 PM, Jaln wrote:
>Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB,
>and only 2 OST,
>then the file will be divided into 6 stripes, denoted as stripe
>the distribution on the 2 OST  would be stripe 0,2,4 on OST0, stripe
>1,3,5 on OST1.
>On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone
><morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>> wrote:
>    I think you may be confused about what a stripe is in Lustre.  If
>    there are only 2 OST, then you can only stripe a file across 2.
>    Or maybe I don't understand your terminology.  I don't know what you
>    mean by "0,4" and "0,2".
>    On 06/13/2013 02:38 PM, Jaln wrote:
>        if I have 6 stripes, 2 OST, using round-robin striping,
>        stripe 0,2,4 will be on OST0,
>        stripe 1,3,5 will be on OST1,
>        Do you guys have any idea about what will be the difference of
>        accessing
>        stripe 0,4 vs stripe 0,2?
>        stripe 0, 2 seems to be closer than 0,4, or the lustre will do
>        some intelligent work?
>        Jaln
>        On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone
>        <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>        <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>             On 06/13/2013 05:19 AM, E.S. Rosenberg wrote:
>              > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone
>              > <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>        <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>              >> Lustre does not  manage the individual disks.  I sits
>        on top of a
>              >> filesystem, either ldiskfs(basically ext4) or zfs (as
>        of Lustre
>             2.4).
>              > Is ZFS the recommended fs, or just an option?
>              > Doesn't ZFS suffer major performance drawbacks on linux
>        due to it
>              > living in userspace?
>              > Thanks,
>              > Eli
>             LLNL (Brian Behlendorf) ported ZFS natively to Linux.  We
>        are not using
>             the FUSE (userspace) version.  You can find it at:
>        http://zfsonlinux.org
>             ZFS is one of the two backend filesystem options for
>        Lustre, as of
>             Lustre 2.4.  2.4 is the first Lustre release that fully
>        supports using
>             ZFS.  Here at LLNL we are using it on our newest, and
>        largest at 55PB,
>             filesystem.
>             Chris

Cheers, Andreas
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division

