[lustre-discuss] Overstriping setting
Andreas Dilger
adilger at thelustrecollective.com
Mon Dec 22 16:47:22 PST 2025
I don't think that using 3 stripes per OST is ever going to be
faster than using 3 separate OSTs, especially if the OSTs are
HDD based instead of flash. Even with NVMe OSTs, there is still
contention on the block device queue (elevator, queue depth, etc.)
With separate OSTs, then there are more resources available that
can be leveraged with less contention. Consider DLM lock server
resources such as the DLM lock hash, or OST filesystem resources
like block allocators. With separate OSTs, those can be used
with less contention compared to having 3 objects sharing the
same resources.
Also, using more OSTs (when warranted) will distribute space
usage more evenly across devices.
That said, there is some benefit to potentially leaving a few
OSTs out of the allocation, if that aligns with the application.
That allows the MDS to skip OSTs that are full or busy, instead
of trying to always allocate objects from all of the OSTs.
That said, there isn't an easy way to overstripe, say, 900 stripes
evenly across 300 of the 370 OSTs, instead of 3 stripes on 160 of
the 370 OSTs and 2 stripes on 210 of the OSTs. It _might_ be good
to do this if it shows better performance, but I think even then
the uneven loading would still be better than only using 300 OSTs.
Cheers, Andreas
On Dec 22, 2025, at 13:16, Wei-Keng Liao <wkliao at northwestern.edu> wrote:
>
> Hi, Andreas
>
> Sorry, if I did not make my question clear at the first place.
>
> I am testing overstriping feature and observed a decent performance
> improvement. Enabling overstriping using only a subset of OSTs is
> just my experiments. I am thinking that for median-size applications
> it may be better to use only a subset of OSTs than all of them. This
> is based on from the perspective of complexity of network communication
> between the computer nodes and OSS nodes.
>
> For example, on Perlmutter at NERSC, there are a total of 370 OSTs.
> If an applications runs on, say 100 compute nodes and 128 MPI
> processes per node, I guess using 100 OSTs is a good number and
> overstriping them with 3 striping count per OST performs better
> than 300 OSTs with no overstriping. Will this be the case?
>
> I will also run some experiments there to see.
>
> Wei-keng
>
>> On Dec 22, 2025, at 1:29 PM, Andreas Dilger <adilger at thelustrecollective.com> wrote:
>>
>> Your first email was not clear that you are trying to overstripe
>> the file on a subset of OSTs. When the MDS is selecting the OSTs
>> for a file, it will always try to put each stripe on a different
>> OST if possible (subject to limitations of the OST pool and free
>> space on OSTs), before overstriping. There isn't any benefit to
>> overstriping a file when there are unused OSTs available, except
>> for synthetic test workloads. In your previous email thread you
>> mentioned the filesystem has 160 OSTs, so an 8-stripe file will
>> always prefer to use 8 different OSTs.
>>
>> Overstriping is not different than regular striping, in that you
>> either need to use an OST pool, or specify the OST indexes to
>> limit the allocation to a subset of OSTs.
>>
>> In your example, the "-C 8" is not more than the number of OSTs,
>> so the overstriping flag is cleared from the layout, since each
>> of the 8 stripes is on a different OST. This is true whether
>> you use "lfs setstripe" or "llapi_layout_*()" calls.
>>
>> Using "-c 4 -C 8" is not different than just "-C 8", since the
>> first stripe count is overwritten by the second stripe count.
>>
>> If this is just for testing bandwidth or similar, then it should
>> be enough to specify "-o M-N,M-N[,...]" for your tests. If there
>> is a good *production* reason to overstripe when there are more
>> OSTs available, then I would be interested to hear what that is.
>>
>> Cheers, Andreas
>>
>>> On Dec 22, 2025, at 10:59, Wei-Keng Liao <wkliao at northwestern.edu> wrote:
>>>
>>> Hi, Andreas
>>>
>>> The lfs-setstripe man page for option '-C' indicates only negative values
>>> can be used, and the file will be striped over all available OSTs. However,
>>> my wish is to stripe a file over only a subset set of available OSTs.
>>> Is it possible to achieve that?
>>>
>>> I just now tried the two commands below without '-o' option. My intent
>>> is to create a file with stripe count of 8 over 4 OSTs. But they both
>>> ended up with the same result of no overstriping.
>>>
>>> % lfs setstripe -c 4 -C 8 $SCRATCH/dummy
>>> % lfs setstripe -C 8 $SCRATCH/dummy
>>>
>>> % lfs getstripe $SCRATCH/dummy
>>> /pscratch/sd/w/wkliao/dummy
>>> lmm_stripe_count: 8
>>> lmm_stripe_size: 1048576
>>> lmm_pattern: raid0
>>> lmm_layout_gen: 0
>>> lmm_stripe_offset: 168
>>> lmm_pool: original
>>> obdidx objid objid group
>>> 168 19587711 0x12ae27f 0x368000041f
>>> 169 19224808 0x12558e8 0x36c0000428
>>> 170 19783691 0x12de00b 0x3700000413
>>> 171 20429006 0x137b8ce 0x3740000419
>>> 172 19633677 0x12b960d 0x3780000421
>>> 173 20027491 0x1319863 0x37c0000402
>>> 174 19912786 0x12fd852 0x3800000401
>>> 175 20862151 0x13e54c7 0x3840000418
>>>
>>>
>>> As for using llapi_layout APIs, I am doing the followings. It seems like
>>> I miss some API call to set the number of overstipes or number of stripes
>>> per OST, as they would not achieve an overstriping setting.
>>>
>>> struct llapi_layout *layout = llapi_layout_alloc();
>>> err = llapi_layout_pattern_set(layout, LLAPI_LAYOUT_OVERSTRIPING);
>>> err = llapi_layout_stripe_count_set(layout, 8);
>>> fd = llapi_layout_file_create(path, O_CREAT|O_RDWR, 0660, layout);
>>>
>>> I found the only way to achieve overstriping is to call
>>> err = llapi_layout_ost_index_set(layout, stripe_number, ost_index);
>>> However, I must pick the values for argument 'ost_index'.
>>>
>>>
>>> Wei-keng
>>>
>>>> On Dec 22, 2025, at 4:32 AM, Andreas Dilger <adilger at thelustrecollective.com> wrote:
>>>>
>>>> You should be able to use "-C N" to overstripe a file without specifying the OST indexes with "-o ...".
>>>>
>>>> For handling this via llapi_layout commands, I believe it is necessary to set llapi_layout_pattern_set(LLAPI_LAYOUT_OVERSTRIPING) flag on the component, and then specify a stripe count > OSTCOUNT. I see this isn't documented in the llapi_layout_pattern_set(3) man page (along with LLAPI_LAYOUT_FOREIGN), so please file a Jira ticket for this (and ideally also submit a patch to the man page).
>>>>
>>>> The flag will be cleared if the stripe count <= OSTCOUNT, for improved compatibility with older clients that do not understand overstriping (though that is unlikely these days).
>>>>
>>>> The patch https://urldefense.com/v3/__https://review.whamcloud.com/54192__;!!Dq0X2DkFhyF93HkjWTBQKhk!WfAgqXmWvikjup5ElLwLsZJgoZKUnWW5SoI78awomasNdwbkf6Z93WQJk7s3RlYK7WjKpirPXZDYqPxEnZKilN5g5Wpb5IY$ ("LU-16938 utils: setstripe overstripe multiple OST count") along with a few follow-on fixes in Lustre 2.16+ also allows specifying:
>>>>
>>>> lfs setstripe -C -N ... FILE|DIR
>>>>
>>>> (or llapi equivalent) to create 'N' stripes per OST for the file, instead of having to know the exact OST count, if that is more convenient.
>>>>
>>>> Cheers, Andreas
>>>>
>>>>> On Dec 20, 2025, at 18:52, Wei-Keng Liao via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>>>>>
>>>>> When setting the overstriping for a new file, is it possible to let
>>>>> the MDS to choose the OST indices?
>>>>>
>>>>> I was able to use lfs command to set an overstiping for a new file.
>>>>> For example, to overstripe a file over 4 OSTs with 2 stripe per OST,
>>>>> I am using this command:
>>>>>
>>>>> % lfs setstripe -c 4 -C 8 -o 10-13,10-13 $SCRATCH/dummy
>>>>>
>>>>> % lfs getstripe $SCRATCH/dummy | grep lmm
>>>>> lmm_stripe_count: 8
>>>>> lmm_stripe_size: 1048576
>>>>> lmm_pattern: raid0,overstriped
>>>>> lmm_layout_gen: 0
>>>>> lmm_stripe_offset: 10
>>>>> lmm_pool: original
>>>>>
>>>>> My understanding is when without overstriping, the default is that
>>>>> the OSTs are selected by Lustre MDS based on some policy (maybe OST
>>>>> usage). I wonder if this can also apply to overstriping, i.e. using
>>>>> lfs command options '-c' and '-C' without option '-o'.
>>>>>
>>>>> I am also wondering how this can be achieved using the Lustre user
>>>>> C APIs, when making calls to llapi_layout_ost_index_set().
>>>>>
>>>>>
>>>>> Wei-keng
>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Dq0X2DkFhyF93HkjWTBQKhk!WfAgqXmWvikjup5ElLwLsZJgoZKUnWW5SoI78awomasNdwbkf6Z93WQJk7s3RlYK7WjKpirPXZDYqPxEnZKilN5gQzWGvgU$
>>>
>>
>> Andreas Dilger
>> Principal Lustre Architect
>> adilger at thelustrecollective.com
>>
>>
>>
>
Andreas Dilger
Principal Lustre Architect
adilger at thelustrecollective.com
More information about the lustre-discuss
mailing list