We are running lustre 1.6 amounting to 12TB of space. We use stripe offset and stripe count as '-1' and stripe size of 2MB . <br>Data on the filesystem comprises of very small to very large files. Some days back we observed write failure on the fs inspite of having <br>

1.2TB space available (as given by df ). The problem was that 2 of the OSTs were 100% full. <br>So can we conclude that more than often the said 2 OSTs were choosen as start offset for files that were smaller in size(<= 2MB)?<br>

<br><div class="gmail_quote">On Wed, Nov 25, 2009 at 2:59 AM, Andreas Dilger <span dir="ltr"><<a href="mailto:adilger@sun.com">adilger@sun.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<div class="im">On 2009-11-24, at 12:17, John White wrote:<br>

>       So I'm trying to get a theoretical understanding of stripe offsets<br>

> in lustre.  As I understand it, the default offset set to 0 results<br>

> in all writes beginning at OSS0-OST0.  With a default stripe of 4,<br>

> doesn't this lead to massive hotspots on OSS0-OST[0-3] (unless *all*<br>

> writes are consistently large)?<br>

<br>

</div>As previously mentioned, the default is NOT to always start files with<br>

OST0, but rather to have a "round-robin with precession" (not random<br>

as is commonly mentioned) so that the OST used for stripe 0 of each<br>

file is evenly distributed among OSTs, regardless of the stripe count.<br>

<div class="im"><br>

>       With our setup, we have 4 OSTs per OSS (well, the last OSS has 3,<br>

> but that's not important right now).  This would appear, in theory,<br>

> to put OSS0 in a very hot situation.<br>

><br>

>       That said, I wonder how efficient a solution setting the stripe<br>

> offset of the root of the file system to -1 ("random") is to solving<br>

> this theoretical situation (given my understanding of striping under<br>

> lustre).<br>

<br>

</div>Well, that is already the default, unless it has been changed at some<br>

time in the past by someone at your site.  We generally recommend<br>

against ever changing the starting index of files, since there are<br>

rarely good reasons to change this.  The man page writes:<br>

<br>

         A start-ost of -1 allows the MDS to choose the starting<br>

         index and it is strongly recommended, as this allows<br>

         space and load balancing to be done by the MDS as needed.<br>

<div class="im"><br>

>       In reality, we have a quite varied workload on our file systems<br>

> with codes ranging from bio to astrophys and, as such, writes<br>

> ranging from very small to very large.  Any real-world experience<br>

> with these situations?  Are there strange inefficiencies or<br>

> administrative difficulties that should be known previous to<br>

> enabling "random" offsets?  Any info would be greatly appreciated.<br>

<br>

<br>

</div>It isn't random, specifically to avoid the case of non-uniform<br>

distribution when many clients are creating files at one time.  With<br>

random stripe-0 OST selection, it is inevitable that some OSTs get one<br>

or two more objects, and some OSTs get one or two fewer objects, and<br>

this can cause dramatic performance impacts.<br>

<br>

For example, if the average objects per OST is 2, but some OSTs get 4<br>

objects and others get no objects then the application may see an<br>

aggregate performance drop of 50% or more, if it were using random<br>

object distribution.  With round-robin distribution, every OST will<br>

get 2 objects (assuming objects / OSTs is a whole number).<br>

<div class="im"><br>

Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Sr. Staff Engineer, Lustre Group<br>

Sun Microsystems of Canada, Inc.<br>

<br>

_______________________________________________<br>

</div><div><div></div><div class="h5">Lustre-discuss mailing list<br>

<a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>

</div></div></blockquote></div><br><br clear="all"><br>-- <br>Regards--<br>Rishi Pathak<br>National PARAM Supercomputing Facility<br>Center for Development of Advanced Computing(C-DAC)<br>Pune University Campus,Ganesh Khind Road<br>

Pune-Maharastra<br>