[Lustre-discuss] Regarding redundancy

Jim Garlick garlick at llnl.gov
Tue Apr 7 08:34:28 PDT 2009


On Tue, Apr 07, 2009 at 08:14:55AM -0400, Brian J. Murrell wrote:
[snip]
> If the lost client has a transaction that needs to be replayed, all of
> the transactions up to that missing transaction are replayed but all
> subsequent transactions are discarded and when the recovery timer
> expires, recovery is aborted.
[snip]

Discarding all transactions causes a lot of collateral damage in a
multi-cluster, mixed parallel job environment where "file-per-process"
style I/O predominates.

Could somebody remind me of the use cases protected by this behavior?

In the case of I/O to a shared file, aren't lustre's errror handling
obligations met by evicting the single offending client?  Perhaps I am
thinking too provincially because in our environment, I/O to shared
files generally (always?) takes place in the context of a parallel job,
and the single client eviction and EIO (or reboot of client) should
be sufficient to terminate the whole job with an error.

Thanks,

Jim



More information about the lustre-discuss mailing list