[lustre-discuss] Stripe size for osts

Mon Mar 21 18:23:32 PDT 2016

On Mar 21, 2016, at 15:50, Pawel Dziekonski <dzieko at wcss.pl> wrote:
> 
>> On pon, 21 mar 2016 at 09:24:02 +0000, Dilger, Andreas wrote:
>>> On 2016/03/18, 12:52, "Kurt Strosahl" <strosahl at jlab.org> wrote:
>>> Good Afternoon,
>>> 
>>>   I'm experimenting with ost configurations geared more towards small
>>> files and operations on those small files (like source code, and
>>> compiling), and I was wondering about changing the stripe size so that
>>> small files fit more efficiently on an ost.  I believe that would be the
>>> --param lov.stripesize=XX option for mkfs.lustre, is that correct?  And
>>> is there a lower limit that I should know about?
>> 
>> Just to clarify, the stripe size for Lustre is not a property of the OST,
>> but rather a property of each file.  The OST itself allocates space
>> internally as it sees fit.  For ldiskfs space allocation is done in units
>> of 4KB blocks managed in extents, while ZFS has variable block sizes (512
>> bytes up to 1MB or more, but only one block size per file) managed in a
>> tree.  In both cases, if a file is sparse then no blocks are allocated for
>> the holes in the file.
>> 
>> As for the minimum stripe size, this is 64KB, since it isn't possible to
>> have a stripe size below the PAGE_SIZE on the client, and some
>> architectures (e.g. IA64, PowerPC, Alpha) allowed 64KB PAGE_SIZE.
>> 
>> For small files, the stripe_size parameter is virtually meaningless, since
>> the data will never exceed a single stripe in size.  What is much more
>> important is to use a stripe_count=1, so that the client doesn't have to
>> query multiple OSTs to determine the file size, timestamps, and other
>> attributes.
> 
> Andreas,
> 
> default stripe size is 1MB. Is there a reason for that?
> P

Yes, because the underlying RAID hardware is usually configured with RAID-6 8+2 1MB stripe width, so 1MB  RPCs writes avoid read-modify-write, and 1MB reads ensure that the reads align properly with the allocation size that was used by the filesystem when the data was written. 

Cheers, Andreas