[Lustre-devel] create performance

Sun Mar 8 23:05:34 PDT 2009

On Mar 06, 2009  13:25 +0000, Eric Barton wrote:
> OST object placement is a hard problem with conflicting requirements
> including...
> 
> 1. Even server space balance
> 2. Even server load balance
> 3. Minimal network congestion
> 4. Scalable ultra-wide file layout descriptor
> 5. Scalable placement algorithm
> 
> Implementing a placement algorithm with a centralized server clearly
> isn't scalable and will have to be reworked for CMD.  A starting
> point might be to explore how to ensure CROW goes some way to
> satisfy requirements 1-3 above.

While CROW can help avoid latency for precreating objects (which can
avoid some of the object allocation imbalances hit today when OSTs
are slow precreating objects), it doesn't really fundamentally help
to balance space and performance of the OSTs.  With any filesystem
with more than a handful of OSTs there shouldn't be any reason why
the OSTs precreating can't keep up with the MDS create rate.  Johann
and I were discussing this problem and I suspect it is only a defect
in the object precreation code and not a fundamental problem int the
design.

I definitely agree that for CMD we will have distributed object
allocation, but so far it isn't clear whether having more than the
MDSes and/or WBC clients doing the allocation will improve the
situation or make it worse.

> BTW, I've long believed that it's a mistake not to give Lustre any
> inkling that all the creates done by a FPP parallel application are
> somehow related - e.g. via a cluster-wide job identifier.  Surely
> file-per-process placement is very close to shared file placement
> (minus extent locking conflicts :)?

Yes, I agree.  In theory it should be possible to extract this kind
of information from the client processes themselves, either by
examining the process environment (some MPI job launchers store the
MPI rank there for pre-launch shell scripts) or by comparing the
filenames being created by the clients.  Any file-per-process job
will invariably create filenames with the rank in the filename.

> I recognize that fixing this
> still leaves the problem of how to get best F/S utilization when
> different applications share a cluster - but I don't think they are
> necessarily the same problem and trying to address them both with the
> same solution seems wrong.
> 
>     Cheers,
>               Eric
> 
> > -----Original Message-----
> > From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On Behalf Of Andreas Dilger
> > Sent: 05 March 2009 9:46 PM
> > To: Nathaniel Rutman
> > Cc: lustre-tech-leads at sun.com
> > Subject: Re: create performance
> > 
> > On Mar 05, 2009  09:39 -0800, Nathaniel Rutman wrote:
> > > Alex Zhuravlav wrote:
> > >>>>>>> Nathaniel Rutman (NR) writes:
> > >>  NR> What about preallocating objects per client, on the clients?
> > >>  NR> Client still needs to get a namespace entry from MDT, but could then
> > >>  NR> hold a write layout lock
> > >>  NR> and do it's own round-robin allocation.  For clients with subtree
> > >>  NR> locks this could avoid any need to talk to the MDT and wouldn't need
> > >>  NR> the writeback cache.
> > >>
> > >> I thought "avoid any need to talk to MDT" implies "writeback cache"
> > >
> > > Hmm, well, maybe you consider this a limited version of writeback cache?
> > > It would be kind of a notification of "here is the layout/objects of my
> > > new file, with my new fid."  Fid ranges and object numbers would be
> > > granted to clients for their own use, and the MDT would only have to do
> > > the namespace entry, asynchronously.  I suppose there's recovery issues
> > > we have to worry about then.
> > >
> > > What I was really trying to get at was to avoid the two step process of
> > > client -> MDT -> OST stripe allocation, which includes an extra network
> > > hop in some precreation starvation cases, and always includes some (a
> > > little?) cpu on the MDT:
> > > 1. clients get object grants for every OST.
> > > 2. clients assign objects to new files and send in reqs to MDT, which
> > > just records the objects in the LOV EA
> > > 3. MDT batches up the assigned objects and sends to OSTs for orphan
> > > cleanup llog.
> > 
> > The main problem with having many clients do precreation themselves is
> > that this will invariably cause load imbalance on the OSTs, which will
> > cause long-term file IO performance problems (much in excess of the
> > performance problems hit during precreate).
> > 
> > Cray recently filed a bug on the read performance of files being noticably
> > hurt by QOS object allocation due to space imbalance, even thoguh the MDS
> > is trying to balance across OSTs locally, but is using random numbers to
> > do this and is not selecting OSTs evenly.
> > 
> > In a file-per-process checkpoint (say 100 processes/files on 100 OSTs)
> > the MDS round-robin will allocate 1 object per OST evenly across all
> > OSTs (excluding the case where an OSC is out of preallocated objects).
> > If clients are doing the allocation (or in the past when the MDS did
> > "random" OST selection) then the chance of all 100 clients allocating
> > on 100 OSTs is vanishingly small.  Instead it is likely that some OSTs
> > will have no objects used, and some will have 2 or 3 or 4, and the
> > aggregate write performance will FOREVER be 50% or 33% or 25% of the
> > MDS-round-robin allocated objects for that set of files.  That is far
> > worse than waiting 1s for the MDS to allocate the objects.
> > 
> > IMHO, if we are doing WBC on the client, then there is no _requirement_
> > that the client has to allocate objects for the files at all, and any
> > write data could just be in the client page cache.  Until the new file
> > is visible on the MDS to another client nobody can even try to access
> > the data.  Once the WBC cache is flushed to the MDS then objects can
> > be allocated by the MDS evenly (granting an exclusive layout lock to the
> > client in the process) until the cached client data is either flushed to
> > disk or at least protected by extent locks and can be partially flushed
> > as needed.
> > 
> > Note that I don't totally object to WBC clients doing object allocations
> > if they are creating a large number of files, in essence becoming an
> > MDS that is tracking the load on the OSTs and balancing object creation
> > appropriately.  What I object to is the more common case where each
> > client is creating a single file for a large FPP checkpoint, and the
> > clients all selecting the OSTs separately.
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> 

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.