[Lustre-discuss] OST redundancy between nodes?

Fri Jun 26 11:09:20 PDT 2009

On Fri, 2009-06-26 at 11:51 -0600, Kevin Van Maren wrote:
> If an OST "fails", meaning that the underlying HW has failed (or the 
> connection to the storage has failed -- one reason to use multipath IO), 
> then Lustre will return IO errors to the application (although there is 
> an RFE to not do that).

This is not entirely true.  It is only true when an OST is configured as
"failout".  When an OST is configured as failover however (which is the
typical case), the application just blocks until the OST can be put back
into service again on any of the defined failover nodes for that OST and
the client can reconnect.  At that time, pending operations are resumed
and the application continues.

> Normally what happens is the OSS _node_ fails, 
> and the other node mounts the OST (typically done by using 
> Linux-HA/Heartbeat).

Right.  And no applications see any errors while this happens.

And it is worth noting that defining an OST for failover does not
require that more than one OSS be defined for it.  You can provide
"failover service" (i.e. no EIOs to clients) using a single OSS.  If it
dies, then clients just block until it can be repaired.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090626/e2c6c562/attachment.pgp>