[Lustre-discuss] stripe offset and hot-spots

Wed Nov 25 09:53:39 PST 2009

On 2009-11-25, at 04:12, rishi pathak wrote:
> We are running lustre 1.6 amounting to 12TB of space. We use stripe  
> offset and stripe count as '-1' and stripe size of 2MB .

Using stripe_count = -1 means "always stripe over all OSTs".

> Data on the filesystem comprises of very small to very large files.  
> Some days back we observed write failure on the fs inspite of having
> 1.2TB space available (as given by df ). The problem was that 2 of  
> the OSTs were 100% full.

That is because you are requesting all files to be stored on all  
OSTs.  If the OSTs are not the same size this can cause such  
problems.  Also, if the OSTs are "small" then it is more likely that  
one or the other will be filled before the others.

As a general rule, it is better to have as large OSTs as possible,  
rather than having more small OSTs.

> So can we conclude that more than often the said 2 OSTs were choosen  
> as start offset for files that were smaller in size(<= 2MB)?

At the time the files are created, Lustre cannot know what size they  
will be.  When there is unbalanced space usage like this it usually  
means that there are a small number of very large files that are  
causing the OSTs to be filled.

There is also ongoing work to improve the allocator so that it will do  
a better job to continually balance space usage across all OSTs,  
rather than only starting the rebalancing when the space usage is very  
different between OSTs.

> On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <adilger at sun.com>  
> wrote:
> On 2009-11-24, at 12:17, John White wrote:
> > So I'm trying to get a theoretical understanding of stripe offsets
> > in lustre.  As I understand it, the default offset set to 0 results
> > in all writes beginning at OSS0-OST0.  With a default stripe of 4,
> > doesn't this lead to massive hotspots on OSS0-OST[0-3] (unless *all*
> > writes are consistently large)?
>
> As previously mentioned, the default is NOT to always start files with
> OST0, but rather to have a "round-robin with precession" (not random
> as is commonly mentioned) so that the OST used for stripe 0 of each
> file is evenly distributed among OSTs, regardless of the stripe count.
>
> >       With our setup, we have 4 OSTs per OSS (well, the last OSS  
> has 3,
> > but that's not important right now).  This would appear, in theory,
> > to put OSS0 in a very hot situation.
> >
> >       That said, I wonder how efficient a solution setting the  
> stripe
> > offset of the root of the file system to -1 ("random") is to solving
> > this theoretical situation (given my understanding of striping under
> > lustre).
>
> Well, that is already the default, unless it has been changed at some
> time in the past by someone at your site.  We generally recommend
> against ever changing the starting index of files, since there are
> rarely good reasons to change this.  The man page writes:
>
>         A start-ost of -1 allows the MDS to choose the starting
>         index and it is strongly recommended, as this allows
>         space and load balancing to be done by the MDS as needed.
>
> >       In reality, we have a quite varied workload on our file  
> systems
> > with codes ranging from bio to astrophys and, as such, writes
> > ranging from very small to very large.  Any real-world experience
> > with these situations?  Are there strange inefficiencies or
> > administrative difficulties that should be known previous to
> > enabling "random" offsets?  Any info would be greatly appreciated.
>
>
> It isn't random, specifically to avoid the case of non-uniform
> distribution when many clients are creating files at one time.  With
> random stripe-0 OST selection, it is inevitable that some OSTs get one
> or two more objects, and some OSTs get one or two fewer objects, and
> this can cause dramatic performance impacts.
>
> For example, if the average objects per OST is 2, but some OSTs get 4
> objects and others get no objects then the application may see an
> aggregate performance drop of 50% or more, if it were using random
> object distribution.  With round-robin distribution, every OST will
> get 2 objects (assuming objects / OSTs is a whole number).
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
>
> -- 
> Regards--
> Rishi Pathak
> National PARAM Supercomputing Facility
> Center for Development of Advanced Computing(C-DAC)
> Pune University Campus,Ganesh Khind Road
> Pune-Maharastra

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.