[Lustre-discuss] Regarding redundancy

Tue Apr 7 08:34:28 PDT 2009

On Tue, Apr 07, 2009 at 08:14:55AM -0400, Brian J. Murrell wrote:
[snip]
> If the lost client has a transaction that needs to be replayed, all of
> the transactions up to that missing transaction are replayed but all
> subsequent transactions are discarded and when the recovery timer
> expires, recovery is aborted.
[snip]

Discarding all transactions causes a lot of collateral damage in a
multi-cluster, mixed parallel job environment where "file-per-process"
style I/O predominates.

Could somebody remind me of the use cases protected by this behavior?

In the case of I/O to a shared file, aren't lustre's errror handling
obligations met by evicting the single offending client?  Perhaps I am
thinking too provincially because in our environment, I/O to shared
files generally (always?) takes place in the context of a parallel job,
and the single client eviction and EIO (or reboot of client) should
be sufficient to terminate the whole job with an error.

Thanks,

Jim