[Lustre-discuss] Question on lustre redundancy/failure features

Mon Jun 28 08:10:36 PDT 2010

On Sun, 2010-06-27 at 05:13 +0800, Emmanuel Noobadmin wrote: 
> 
> However, I'm still quite confused and haven't read the manual fully
> because I'm tripping on this: what exactly happens if a piece of
> hardware fails?

What happens depends on which piece of hardware fails.  If it's an OSS
configured for failover, the backup OSS takes over serving the OSTs.

Ditto for an MDS.

If it's a disk in a RAID LUN, well, you replace the disk and let RAID
rebuild the LUN.

> For example, if I have a simple 5 machine cluster, one MDS/MDTand one
> failover MDS/MDT.

We should get you started out correctly with nomenclature and concepts.

For any give filesystem there can be only 1 MDT.  The MDT is the actual
device/disk and associated processes that stores and serves the
metadata.  You can have 1 or more MDSes configured to provide service
for it.  Of course, if you have more than one, then somehow, usually
through shared storage, all of those machines must be able to see the
MDT (the disk).

An MDS is a physical machine that hosts (can provide) MDT services.  You
can only have one active MDS at a time -- that is, only one MDS can have
the MDT mounted at a time.  This is paramount.  No more than one machine
can mount the MDT at a time.

> Three OSS/OST machines

They are usually just called OSSes.

> with 4 drives each, for 2
> sets of MD Raid 1 block devices and so total of 6 OST if I didn't
> understand the term wrongly.
> 
> What happens if one of the OSS/OST dies, say motherboard failure?

In order to survive such a failure, the OST must be visible by another
OSS which can then mount it and provide service for it.

> Because the manual mentions data striping across multiple OST, it
> sounds like either networked RAID 0 or RAID 5.

Lustre does not provide any form of data redundancy and expects the
storage below it to provide that, so yes, if you value your data, you
put your OSTs on RAID disk.

> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead.

No.  Even if you didn't configure failover (so that another machine can
provide service for the OST(s)), the filesystem is still available for
access to any data that is not on the OSTs of a failed, non-failover
configured OSS.  Any access to data from the failed OSSes OSTs will
either just block (i.e. hang) the client's request until the OSS is
brought back into service, or I/O to failed OSTs can return an EIO to
client.  That is configurable by the administrator.

> It doesn't seem to make sense for Lustre to
> fail in this manner. Where as if Lustre implements network RAID 5, the
> cluster would continue to serve all data despite the dead machine.

I think you are missing the point of failover (with shared disk).  A
failure of an OSS is survivable in that case.

> Yet the manual warns that Lustre does not have redundancy and relies
> entirely on some kind of hardware RAID being used. So it seems to
> imply that the network RAID 0 is what's implemented.

No.  Lustre provides no RAID at all.

> Does this then mean that if I want redundancy on the storage, I would
> basically need to have a failover machine for every OSS/OST?

Yes,  Typically people configure active/active failover for OSTs.  That
is, if they have enough disk for 12 OSTs, they configure two OSSes and
put 6 OSTs on each with each OST also being configured to provide
service for the other OSSes 6.  So normally, each OSS actively provides
service for 6 OSTs but if one of the OSSes fails, the survivor takes
over service for and provides for all 12 OSTs.

> I'm also confused because the manual says an OST is a block device
> such as /dev/sda1 but OSS can be configured to provide failover
> services. But if the OSS machine which houses the OST dies, how would
> another OSS take over anyway since it would not be able to access the
> other set of data?

You need to be using some sort of shared storage where two computers can
both see the same disk.  This is typically achieved with FC SCSI type
configurations, however it can be done at the lower end with Firewire
(which supports shared access, to the extent of various hardware and
software implementations).  Others here are also using DRBD, but we
(Oracle) don't really have any experience with the robustness of such a
solution, so you will need to test it for yourself to your level of
satisfaction.

> Or does that mean this functionality is only available if the OST in
> the cluster are standalone SAN devices?

Well, not so much actual SAN devices -- which, IIUC usually implies a
filesystem service not a block device, but yes, you are typically
referring to disks that are physically outside of the OSSes and
connected via some sharable medium such as FC SCSI or infiniband, etc.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100628/49e2b71f/attachment.pgp>