[Lustre-discuss] Question on lustre redundancy/failure features

Mon Jun 28 06:44:15 PDT 2010

Hello, being a newbie myself I've just recently worked through all these 
questions myself, here's what I've learned..

On 6/26/2010 2:13 PM, Emmanuel Noobadmin wrote:
> I'm looking at using Lustre to implement a centralized storage for
> several virtualized machines. The key consideration being reliability
> and ease of increasing/replacing capacity.
Increasing capacity is easy, replacing it will take some practice and 
careful reading of the manual and the mailing list archives.
> However, I'm still quite confused and haven't read the manual fully
> because I'm tripping on this: what exactly happens if a piece of
> hardware fails?
> Perhaps it's because I haven't yet tried to setup Lustre so the terms
> used don't quite translate for me yet. So I'll appreciate some newbie
> hand holding here :)
>
> For example, if I have a simple 5 machine cluster, one MDS/MDTand one
> failover MDS/MDT. Three OSS/OST machines with 4 drives each, for 2
> sets of MD Raid 1 block devices and so total of 6 OST if I didn't
> understand the term wrongly.
Think you understood it correctly there.
> What happens if one of the OSS/OST dies, say motherboard failure?
> Because the manual mentions data striping across multiple OST, it
> sounds like either networked RAID 0 or RAID 5.
networked RAID 0 is the closest analogy.
> In the case of network RAID 0, a single machine failure means the
> whole cluster is dead. It doesn't seem to make sense for Lustre to
> fail in this manner. Where as if Lustre implements network RAID 5, the
> cluster would continue to serve all data despite the dead machine.
This is why the manual points out that it's important to have reliable 
hardware on the back-end.  I would strongly suggest a SAN/NAS solution 
or at least a well-tested and executed backup strategy.
> Yet the manual warns that Lustre does not have redundancy and relies
> entirely on some kind of hardware RAID being used. So it seems to
> imply that the network RAID 0 is what's implemented.
>
Yup.
> This appears to be the case given the example in the manual of a
> simple combined MGS/MDT with two OSS/OST which uses the same fsname
> "temp" for the OSTs, which then combines the two 16MB OST into a
> single 30MB block device mounted as /lustre on the client.
>
> Does this then mean that if I want redundancy on the storage, I would
> basically need to have a failover machine for every OSS/OST?
>
Correct, however if you are using a 5 node cluster 2 mgs/mds and 3 oss 
then the 3 oss servers could be configured to back each other up in the 
event of a failure, assuming you were using a SAN/NAS solution for the 
storage.  If not then I would recommend extra drives in each machine 
that a backup of the failed OST could be restored to.
> I'm also confused because the manual says an OST is a block device
> such as /dev/sda1 but OSS can be configured to provide failover
> services. But if the OSS machine which houses the OST dies, how would
> another OSS take over anyway since it would not be able to access the
> other set of data?
>
> Or does that mean this functionality is only available if the OST in
> the cluster are standalone SAN devices?
This would be the most advisable hardware configuration from my 
experience.  If on the other hand, you have spare hardware for the 
production servers(such as replacement mobo, drives, etc..) then you can 
be fairly safe as long as you ensure that you have a proper RAID 
configuration on your lustre partitions.  You will experience downtime 
while you replace failed core components(mobo, proc, ram, etc), but if 
it's just a RAID member HD, then lustre can keep on truckin.  Downtime 
should only be as long as it takes to replace the part.  We make it a 
point to always have a hot-spare of any core production machine that we 
have in the rack.  So if you only have 5 machines to work with(and no 
NAS/SAN), I would suggest moving to a 4 node Lustre environment and 
keeping the 5th server as a hot-spare.

Good Luck!
-Billy Olson