[Lustre-discuss] stripe offset and hot-spots

Tue Nov 24 11:53:27 PST 2009

John White wrote:
> Hello Folks,
> 	So I'm trying to get a theoretical understanding of stripe offsets in lustre.  As I understand it, the default offset set to 0 results in all writes beginning at OSS0-OST0.  With a default stripe of 4, doesn't this lead to massive hotspots on OSS0-OST[0-3] (unless *all* writes are consistently large)?
> 
> 	With our setup, we have 4 OSTs per OSS (well, the last OSS has 3, but that's not important right now).  This would appear, in theory, to put OSS0 in a very hot situation.
> 
> 	That said, I wonder how efficient a solution setting the stripe offset of the root of the file system to -1 ("random") is to solving this theoretical situation (given my understanding of striping under lustre).   
> 
> 	In reality, we have a quite varied workload on our file systems with codes ranging from bio to astrophys and, as such, writes ranging from very small to very large.  Any real-world experience with these situations?  Are there strange inefficiencies or administrative difficulties that should be known previous to enabling "random" offsets?  Any info would be greatly appreciated.
> ----------------
> John White

Hi John,

AFIAK, -1 is the default, so objects are allocated randomly, avoiding
the hot-spot situation you describe. Setting the offset to anything
other than random is probably a bad idea.

Setting the stripe count is a more complicated question.

We set the filesystem not to stripe by default, but then set striping on
datasets that we know are going to be hot. Many bio apps like to read a
common data sets in parallel and striping really helps performance.

The only problem we have historically seen  has been  OSTs becoming
unbalanced over time. If an OST fills up, users get sporadic "filesystem
full" errors even though df shows free space.

We found that this was typically due to code being left in debug mode,
and writing out multi TByte log files with no striping, which led to
single OSTs filling up. Quotas are your friend here.

Lustre 1.6 and later will switch to a weighted allocator if the OSTs
start to become unbalanced (see section 24.4.4 in the 1.6 manual).
Making the OST size as large as possible help too, so that the average
file size << OST size.

Ultimately you need to spend time educating your users about the pro and
cons of striping, especially if you have a very mixed application set.

Having good stats is important too; we have historical load graphs for
the OSSs (ganglia) and for our OSTs (rrdtool data collection straight
from the disk controllers). That helps us to  identify times when users
have got things wrong and a single OST/OSS is being hammered due to
incorrect striping.

Cheers,

Guy

-- 
Dr Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 ex 6925
Fax: +44 (0)1223 496802

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.