[Lustre-discuss] how the lustre distribute data among disks within one OST
Dilger, Andreas
andreas.dilger at intel.com
Mon Jun 17 16:18:38 PDT 2013
On 2013/16/06 12:02 AM, "Jaln" <valiantljk at gmail.com> wrote:
>Hi Andreas,
>Thanks a lot,
>>this can all be pipelined by the client, which sends up to 8 RPCs
>>concurrently
>>for each OST.
>
>Can you plz explain a little bit about why "this can all be pipelined by
>the client"
>how does the client pipeline it?
>do you mean pipeline the multiple processes?
The RPC service on the client node will send up to 8 RPCs write
asynchronously
before blocking and waiting for a reply. This allows even single-threaded
applications to have reasonable IO performance, though still better
performance
can be seen by multiple userspace threads on the client. The reason is
that
copy_from_user() in the kernel becomes CPU-bound copying the data from
userspace
to the kernel buffers. Using O_DIRECT to avoid this data copy avoids this
issue,
but introduces a separate issue that O_DIRECT requires data not to be
buffered,
which Lustre takes to mean "sync'd to disk on the server" so that it is
safe in
the face of a crash of either client or server, so is not faster unless
very
large writes are done by the client.
Cheers, Andreas
>On Fri, Jun 14, 2013 at 2:47 PM, Dilger, Andreas
><andreas.dilger at intel.com> wrote:
>
>On 2013/13/06 6:36 PM, "Jaln" <valiantljk at gmail.com> wrote:
>
>>Thank you Chris, I'm sort of clear now.
>>In my question, stripe 0,4 means one process wants to access stripe 0 and
>>4 at the same time.
>>there is another process wants to access both stripe 0 and 2,
>
>
>Just to clarify the Lustre terminology here, if there are only 2 OSTs
>involved,
>there will only be two stripes, with index "0" and "1" (each with an
>arbitrary
>object ID), one on each OST. In your case, each one will be an object of
>3MB
>in size.
>
>>even though stripe 0, 2, 4 are in the same place (one file),
>>but their offsets are different, i.e., 0 and 2 are contiguous,
>>while from 0 to 4 there is a gap.
>
>
>Right, this is no different than an application reading from megabytes 0,1
>or 0,2
>from a local disk filesystem. There will be a seek in the middle, unless
>the
>client, OSS, or RAID/disk decide to do readahead on the file or object.
>If the
>file is <= 2MB in size (llite.*.max_read_ahead_whole_mb tunable), Lustre
>will just prefetch the whole file on first access.
>
>>So my concern is, will the two processes have different I/O cost?
>>In other words, accessing 0 and 4 would take longer time than accessing 0
>>and 2.
>
>
>Sure, one seek per MB accessed (<= 10ms), but this is relatively close
>compared
>to the network transfer time (10ms per MB for 1GigE, 1ms per MB for
>10GigE), and
>this can all be pipelined by the client, which sends up to 8 RPCs
>concurrently
>for each OST.
>
>Cheers, Andreas
>
>>On Thu, Jun 13, 2013 at 5:23 PM, Christopher J. Morrone
>><morrone2 at llnl.gov> wrote:
>>
>>In that case, it is the question part that I do not understand. :) What
>>is "stripe 0,4", why could it be "closer" then "stripe 0,2"? In your
>>example, 0, 2, and 4 are all in the same place.
>>
>>If you file is striped over 2 OSTs, then essentially what happens behind
>>the scenes is that there are two files, one on each OST. But Lustre
>>hides that from you, as a user. Lustre basically does modulo operations
>>to translate a file offset from the file that
>> it presents to the user, into which ost and offset into said ost's file
>>to use.
>>
>>Does that help at all?
>>
>>Chris
>>
>>
>>On 06/13/2013 02:58 PM, Jaln wrote:
>>
>>Oh, I mean there is one file, for example 6 MB, the stripe size is 1MB,
>>and only 2 OST,
>>then the file will be divided into 6 stripes, denoted as stripe
>>0,1,2,3,4,5.
>>the distribution on the 2 OST would be stripe 0,2,4 on OST0, stripe
>>1,3,5 on OST1.
>>
>>Jaln
>>
>>
>>On Thu, Jun 13, 2013 at 2:54 PM, Christopher J. Morrone
>>
>><morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>> wrote:
>>
>> I think you may be confused about what a stripe is in Lustre. If
>> there are only 2 OST, then you can only stripe a file across 2.
>>
>> Or maybe I don't understand your terminology. I don't know what you
>> mean by "0,4" and "0,2".
>>
>>
>> On 06/13/2013 02:38 PM, Jaln wrote:
>>
>> if I have 6 stripes, 2 OST, using round-robin striping,
>> stripe 0,2,4 will be on OST0,
>> stripe 1,3,5 will be on OST1,
>> Do you guys have any idea about what will be the difference of
>> accessing
>> stripe 0,4 vs stripe 0,2?
>> stripe 0, 2 seems to be closer than 0,4, or the lustre will do
>> some intelligent work?
>>
>> Jaln
>>
>>
>> On Thu, Jun 13, 2013 at 10:22 AM, Christopher J. Morrone
>> <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>>
>> <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>>
>> On 06/13/2013 05:19 AM, E.S. Rosenberg wrote:
>> > On Thu, Jun 13, 2013 at 3:09 AM, Christopher J. Morrone
>> > <morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>
>>
>> <mailto:morrone2 at llnl.gov <mailto:morrone2 at llnl.gov>>> wrote:
>> >> Lustre does not manage the individual disks. I sits
>> on top of a
>> >> filesystem, either ldiskfs(basically ext4) or zfs (as
>> of Lustre
>> 2.4).
>> > Is ZFS the recommended fs, or just an option?
>> > Doesn't ZFS suffer major performance drawbacks on linux
>> due to it
>> > living in userspace?
>> > Thanks,
>> > Eli
>>
>> LLNL (Brian Behlendorf) ported ZFS natively to Linux. We
>> are not using
>> the FUSE (userspace) version. You can find it at:
>>
>> http://zfsonlinux.org
>>
>> ZFS is one of the two backend filesystem options for
>> Lustre, as of
>> Lustre 2.4. 2.4 is the first Lustre release that fully
>> supports using
>> ZFS. Here at LLNL we are using it on our newest, and
>> largest at 55PB,
>> filesystem.
>>
>> Chris
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
More information about the lustre-discuss
mailing list