[Lustre-discuss] Question on lustre redundancy/failure features

Mon Jun 28 01:28:34 PDT 2010

> I'm looking at using Lustre to implement a centralized storage
> for several virtualized machines.

That's such a cliche, and Lustre is very suitable for it if you
don't mind network latency :-), or if you use a very low latency
fabric.

In general I am surprised (or perhaps not :->) by how many
"clever" people choose to provide resource virtualization and
parallelization at the lower levels of abstaction (e.g. block
device) and not at the higher ones (service protocol), thus
enjoying all the "benefits" of centralization. But then probably
they don't care about availability and in particular abut latency
(and sometimes not even about throughput).

> The key consideration being reliability

Data "reliability" is not a Lustre concern as such. Eventually
Lustre on ZFS will gain what ZFS offersq about it. Also Lustre 2.x
will have object level redundancy (sort of like RAID1), somewhat
compromising the purity of its design.

For overall service availability choose carefully Lustre version
and patches, that is do extensive integration testing before
production use.

Lots of sites have reported spending a few months figuring out the
combination of firmware, OS, and Lustre versions that actually
works well together. Lustre setups tend to be demanding and to
exercise corner cases that less ambitious systems don't reach.

> and ease of increasing/replacing capacity.

Add more OSSes with more OSTs. Some sites have hundreds or
thousands. While avoiding having too few MDSes. Which means more
than one Lustre instance (which can be done in some cool ways, as
nothing prevents a node from being a frontend for more than one
instance).

Note that like in many other cases in your specific application
there is no significant benefit from having a single storage pool
(Lustre instance).

> However, I'm still quite confused and haven't read the manual
> fully because I'm tripping on this: what exactly happens if a
> piece of hardware fails?

The manual, the Wiki, a number of papers and presentations and
this mailing list have extensive discussions of various schemes.

Keep inm kind that Lustre is fundamentally aimed at being an HPCn
filesystem, not a HA one. That is the primary use of multiple hw
resources is parallelism not redundancy.

> For example, if I have a simple 5 machine cluster, one
> MDS/MDTand one failover MDS/MDT. Three OSS/OST machines with 4
> drives each, for 2 sets of MD Raid 1 block devices [ ... ]

That's somewhat unusual, as this leaves parallelization entirely
up to the Lustre layer striping, which perhaps is not wise. It is
surely wiser than using parity RAID for OSTs (which is what the
Lustre docs suggest for data, while RAID10 is recommended for
metadata).

> What happens if one of the OSS/OST dies, say motherboard
> failure?  Because the manual mentions data striping across
> multiple OST, it sounds like either networked RAID 0 or RAID 5.

Sort of like RAID0 but at the object (file or file section) level
instead of block level.

> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead. It doesn't seem to make sense for Lustre to
> fail in this manner. [ ... ]

Perhaps it does make sense ton others. :-)

> Yet the manual warns that Lustre does not have redundancy and
> relies entirely on some kind of hardware RAID being used. So it
> seems to imply that the network RAID 0 is what's implemented.

The manual is pretty clear on that.

> Does this then mean that if I want redundancy on the storage,
> I would basically need to have a failover machine for every
> OSS/OST?

Depending on how much redundancy you want to achieve, you may
need both failover machine and failover drives.

> I'm also confused because the manual says an OST is a block
> device such as /dev/sda1 but OSS can be configured to provide
> failover services. [ ... ] Or does that mean this
> functionality is only available if the OST in the cluster are
> standalone SAN devices?

Whichever storage device can be shared across multiple servers
in a hot/warm setup.

There are detailed discussions of frontend server failover (
various HA schemes) and storage backend replication (DRBD for
example) setups in the Lustre Wiki and several papers.