[Lustre-discuss] Regarding redundancy

Tue Apr 7 19:37:41 PDT 2009

On Tue, Apr 07, 2009 at 11:43:19AM -0400, Brian J. Murrell wrote:
> On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote:
> > 
> > Discarding all transactions
> 
> Only transactions subsequent to a missing transaction.
> 
> >  causes a lot of collateral damage in a
> > multi-cluster, mixed parallel job environment where "file-per-process"
> > style I/O predominates.
> 
> Indeed, depending where the AWOL client's transaction sits in the replay
> stream.  So if it was the last transaction, the loss is absolutely
> minimal but if it was the first transaction, the loss is absolutely
> maximal.
> 
> > Could somebody remind me of the use cases protected by this behavior?
> 
> Simply transactional dependency.
> 
> If you don't know what the AWOL client did to a given file, you cannot
> reliably process any further updates to that file, and if you don't have
> the AWOL client to ask what files it has transactions for, everything
> subsequent to that client's transaction has to be suspect.  While I
> don't have any examples off-hand, I am sure one of the devs that
> constantly have their fingers in replay can cite many actual scenarios
> where this is a problem.

For us, error handling is at best:  abort the parallel job on EIO,
throw away the output, and restart from the last checkpoint.  I am
virtually certain that nobody around here tries to recover from an EIO.

Also, if the AWOL node has actually rebooted, it will cause our resource
manager to terminate the whole parallel job.  This is good in the sense that
it gives codes with poor I/O error handling is a second chance of noticing
the error before bad physics data has to be analyzed and explained.
Not so with the collateral evictions.

So, it would be pretty easy for us to patch our 1.6.6 based lustre
to allow those transactions after the missed one to be committed and
avoid the collateral evictions.  We suspect this is a bad idea but we
are having a hard time imagining why.

Any insight would be apprecaited.

> > In the case of I/O to a shared file, aren't lustre's errror handling
> > obligations met by evicting the single offending client?
> 
> No.  All clients subsequently have to be evicted, per the above.
> 
> > Perhaps I am
> > thinking too provincially because in our environment, I/O to shared
> > files generally (always?) takes place in the context of a parallel job,
> > and the single client eviction and EIO (or reboot of client) should
> > be sufficient to terminate the whole job with an error.
> 
> Yours is probably a scenario where VBR will do really well then given
> that VBR only serializes replay on truly dependent transactions rather
> than the single serial stream (of assumed dependent transactions) that
> replay currently operates with.
> 
> b.
> 

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http:// lists.lustre.org/mailman/listinfo/lustre-discuss