[Lustre-discuss] how the lustre distribute data among disks within one OST

Dilger, Andreas andreas.dilger at intel.com
Mon Jun 17 16:18:38 PDT 2013


On 2013/16/06 12:02 AM, "Jaln" <valiantljk at gmail.com> wrote:

>Hi Andreas,
>Thanks a lot,
>>this can all be pipelined by the client, which sends up to 8 RPCs
>>concurrently
>>for each OST.
>
>Can you plz explain a little bit about why "this can all be pipelined by
>the client"
>how does the client pipeline it?
>do you mean pipeline the multiple processes?

The RPC service on the client node will send up to 8 RPCs write
asynchronously
before blocking and waiting for a reply.  This allows even single-threaded
applications to have reasonable IO performance, though still better
performance
can be seen by multiple userspace threads on the client.  The reason is
that
copy_from_user() in the kernel becomes CPU-bound copying the data from
userspace
to the kernel buffers.  Using O_DIRECT to avoid this data copy avoids this
issue,
but introduces a separate issue that O_DIRECT requires data not to be
buffered,
which Lustre takes to mean "sync'd to disk on the server" so that it is
safe in
the face of a crash of either client or server, so is not faster unless
very
large writes are done by the client.

Cheers, Andreas

>On Fri, Jun 14, 2013 at 2:47 PM, Dilger, Andreas
><andreas.dilger at intel.com> wrote:
>
>On 2013/13/06 6:36 PM, "Jaln" <valiantljk at gmail.com> wrote:
>
>>Thank you Chris, I'm sort of clear now.
>>In my question, stripe 0,4 means one process wants to access stripe 0 and
>>4 at the same time.
>>there is another process wants to access both stripe 0 and 2,
>
>
>Just to clarify the Lustre terminology here, if there are only 2 OSTs
>involved,
>there will only be two stripes, with index "0" and "1" (each with an
>arbitrary
>object ID), one on each OST.  In your case, each one will be an object of
>3MB
>in size.
>
>>even though stripe 0, 2, 4 are in the same place (one file),
>>but their offsets are different, i.e., 0 and 2 are contiguous,
>>while from 0 to 4 there is a gap.
>
>
>Right, this is no different than an application reading from megabytes 0,1
>or 0,2
>from a local disk filesystem.  There will be a seek in the middle, unless
>the
>client, OSS, or RAID/disk decide to do readahead on the file or object.
>If the
>file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre
>will just prefetch the whole file on first access.
>
>>So my concern is, will the two processes have different I/O cost?
>>In other words, accessing 0 and 4 would take longer time than accessing 0
>>and 2.
>
>
>Sure, one seek per MB accessed (<= 10ms), but this is relatively close
>compared
>to the network transfer time (10ms per MB for 1GigE, 1ms per MB for
>10GigE), and
>this can all be pipelined by the client, which sends up to 8 RPCs
>concurrently
>for each OST.
>
>Cheers, Andreas
>
>>On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone
>><morrone2 at llnl.gov> wrote:
>>
>>In that case, it is the question part that I do not understand. :)  What
>>is "stripe 0,4", why could it be "closer" then "stripe 0,2"?  In your
>>example, 0, 2, and 4 are all in the same place.
>>
>>If you file is striped over 2 OSTs, then essentially what happens behind
>>the scenes is that there are two files, one on each OST.  But Lustre
>>hides that from you, as a user.  Lustre basically does modulo operations
>>to translate a file offset from the file that
>> it presents to the user, into which ost and offset into said ost's file
>>to use.
>>
>>Does that help at all?
>>
>>Chris
>>
>>
>>On 06/13/2013 02:58 PM, Jaln wrote:
>>
>>Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB,
>>and only 2 OST,
>>then the file will be divided into 6 stripes, denoted as stripe
>>0,1,2,3,4,5.
>>the distribution on the 2 OST  would be stripe 0,2,4 on OST0, stripe
>>1,3,5 on OST1.
>>
>>Jaln
>>
>>
>>On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone
>>
>><morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>> wrote:
>>
>>    I think you may be confused about what a stripe is in Lustre.  If
>>    there are only 2 OST, then you can only stripe a file across 2.
>>
>>    Or maybe I don't understand your terminology.  I don't know what you
>>    mean by "0,4" and "0,2".
>>
>>
>>    On 06/13/2013 02:38 PM, Jaln wrote:
>>
>>        if I have 6 stripes, 2 OST, using round-robin striping,
>>        stripe 0,2,4 will be on OST0,
>>        stripe 1,3,5 will be on OST1,
>>        Do you guys have any idea about what will be the difference of
>>        accessing
>>        stripe 0,4 vs stripe 0,2?
>>        stripe 0, 2 seems to be closer than 0,4, or the lustre will do
>>        some intelligent work?
>>
>>        Jaln
>>
>>
>>        On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone
>>        <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>>
>>        <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>>
>>             On 06/13/2013 05:19 AM, E.S. Rosenberg wrote:
>>              > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone
>>              > <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>>
>>        <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>>              >> Lustre does not  manage the individual disks.  I sits
>>        on top of a
>>              >> filesystem, either ldiskfs(basically ext4) or zfs (as
>>        of Lustre
>>             2.4).
>>              > Is ZFS the recommended fs, or just an option?
>>              > Doesn't ZFS suffer major performance drawbacks on linux
>>        due to it
>>              > living in userspace?
>>              > Thanks,
>>              > Eli
>>
>>             LLNL (Brian Behlendorf) ported ZFS natively to Linux.  We
>>        are not using
>>             the FUSE (userspace) version.  You can find it at:
>>
>>        http://zfsonlinux.org
>>
>>             ZFS is one of the two backend filesystem options for
>>        Lustre, as of
>>             Lustre 2.4.  2.4 is the first Lustre release that fully
>>        supports using
>>             ZFS.  Here at LLNL we are using it on our newest, and
>>        largest at 55PB,
>>             filesystem.
>>
>>             Chris

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division





More information about the lustre-discuss mailing list