[Lustre-discuss] OST redundancy between nodes?

Mon Jul 13 12:06:55 PDT 2009

Comments in-line.
-
CS.

On Fri, Jun 26, 2009 at 1:09 PM, Brian J. Murrell<Brian.Murrell at sun.com> wrote:
> On Fri, 2009-06-26 at 11:51 -0600, Kevin Van Maren wrote:
>> If an OST "fails", meaning that the underlying HW has failed (or the
>> connection to the storage has failed -- one reason to use multipath IO),
>> then Lustre will return IO errors to the application (although there is
>> an RFE to not do that).
>
> This is not entirely true.  It is only true when an OST is configured as
> "failout".  When an OST is configured as failover however (which is the
> typical case), the application just blocks until the OST can be put back
> into service again on any of the defined failover nodes for that OST and
> the client can reconnect.  At that time, pending operations are resumed
> and the application continues.
>

The application does not block for all commands. For example, lfs df
would work and so does new file creation (if you have another OST
running). However, querying disk space such df or ls will fail. And
this fails even after deactivating OST on MDS.

>> Normally what happens is the OSS _node_ fails,
>> and the other node mounts the OST (typically done by using
>> Linux-HA/Heartbeat).
>
> Right.  And no applications see any errors while this happens.
>
> And it is worth noting that defining an OST for failover does not
> require that more than one OSS be defined for it.  You can provide
> "failover service" (i.e. no EIOs to clients) using a single OSS.  If it
> dies, then clients just block until it can be repaired.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>