[Lustre-devel] async write and abort_recov

Andreas Dilger andreas.dilger at oracle.com
Mon Jul 12 13:16:32 PDT 2010


On 2010-07-12, at 04:10, Aurelien Degremont wrote:
> I'm wondering how Lustre client handles recovery when OST restarts with abort_recov flag set.
> 
> Let's say a client has page to flush to OST, but OST is stopped, then restarts with -o abort_recov.  There is no recovery, so:
> 1- client retakes extent locks and then re-try to flush its pages
> or
> 2- client cannot flush anymore and drop the i/o, returns an error to the caller.

When the client is evicted, it drops all of its locks for that OST, and any unwritten pages for those files is discarded.  While I know Lustre will save errors from async write RPCs into the file descriptor (for later write calls or fsync), I don't know if we save any IO error into the file descriptor if we discard pages due to eviction.  I think only errors due to currently in-flight RPCs that are aborted due to client eviction are returned.

This is the same for "-o abort_recov" or if the client is evicted for other reasons (failed lock callbacks, or failed recovery even if abort_recovery is not used).

> If #2, what if the process has already closed the file ?
> What is the file is still opened and the process try to do another I/O, will it have an error for the former bad i/o?

If the file is not closed yet, then fsync or a later write will return an earlier error.  If the file descriptor is closed then there is no way to return that error.  That is true for local filesystems as well.

> abort_recov is used only at first start, or the OST uses this flag until it is stopped for any other recovery-like mechanisms?

The "abort_recov" mount option is equivalent to:

	lctl --device {ost dev} abort_recovery

it is only affecting the initial startup recovery, and is ignored afterward.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-devel mailing list