[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Fri Jul 22 08:00:55 PDT 2011

Tyler Hawes wrote:
> Apologies if this is a bit newbie, but I'm just getting started, 
> really. I'm still in design / testing stage and looking to wrap my 
> head around a few things.
>
> I'm most familiar with Fibre Channel storage. As I understand it, you 
> configure a pair of OSS per OST, one actively serving it, the other 
> passively waiting in case the primary OSS fails. Please correct me if 
> I'm wrong...
No, that's basically it.  Lustre works well with FC storage, although a 
full SAN configuration (redundant switch fabrics) is not often used: 
with only 2 servers needing access to each LUN, and bandwidth to storage 
being key, servers are most often directly attached to the FC storage, 
with multiple paths to handle controller/path failure and improve BW.

But to clarify one point, Lustre is not waiting passively on the backup 
server.  Lustre can only be active on one server for a given OST at a 
time.  Some high-availability package, external to Lustre, is 
responsible for ensuring Lustre is active on one server (the OST is 
mounted on one server).  Heartbeat was quite popular, but more people 
have been moving to the more modern packages like Pacemaker.  It is left 
to the HA package to perform failover as necessary, even though most HA 
packages do not perform failover by default if the network or back-end 
storage link goes down (which is where bonded networks and storage 
multipath could come in).

> With SAS/SATA direct-attached storage (DAS), though, it's a little 
> less clear to me. With SATA, I imagine that if an OSS goes down, all 
> it's OSTs go down with it (whether they be internal or external 
> mounted drives), since there is no multipathing. Also, I suppose I'd 
> want a hardware RAID controller PCIe card, which would also preclude 
> failover since it's not going to have cache and configuration mirrored 
> in another OSS's RAID card.

Normally, yes.  Sun shipped quite a bit of Lustre stoage with failover 
using SATA in external enclosures (J4400), but that was special in that 
there were (2) SAS expanders per enclosure, and each drive was connected 
to a SATA MUX to allow both servers access to the SATA drives.

I am glad you understand the hazards of connecting two servers using 
internal raid controllers with external storage.  Until a RAID card is 
developed specifically designed with that in mind (and strictly uses a 
write-though cache), it is a very bad idea.  [For others, please 
consider what would happen to the file system if the raid card has a 
battery backed cache with a bunch of pending writes to get replayed at 
some point _after_ the other server completes recovery.]

If you are using a SAS-attached external RAID enclosure, then it is not 
much different than using a FC-attached RAID.  Ie, the direct-attached 
ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with 
the only architecture change being the use of a SAS card/cables instead 
of an FC card/cables.  The big difference between SAS and FC is that 
people are not (yet) building SAS-based SANs.  Already many FC arrays 
have moved to SAS drives on the back end.
http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html

> With SAS, there seems to be a new way of doing this that I'm just 
> starting to learn about, but is a bit fuzzy still to me. I see that 
> with things like Storage Bridge Bay storage servers from the likes of 
> Supermicro, there is a method of putting two server motherboards in 
> one enclosure, having an internal 10GigE link between them to keep 
> cache coherency, some sort of software layer to manage that (?), and 
> then you can use inexpensive SAS drives internally and through 
> external JBOD chassis. Is anyone using something like this with Lustre?

Some people have used (or at least toyed with using) DRDB and Lustre, 
but I would not say it is fast, recommended, or a mainstream Lustre 
configuration.  But that is one way to replicate internal storage across 
servers, to allow Lustre failover.

With SAS drives in an external enclosure, it is possible to configure 
shared storage for use with Lustre, although if you are using a JBOD 
rather than a raid controller, there are the normal issues (Linux SW 
raid/LVM layers are not "clustered", so you have to ensure they are only 
active on one node at a time).

> Or perhaps I'm not seeing the forest through the trees and Lustre has 
> software features built-in that negate the need for this (such as 
> parity of objects at the server level, so you can loose N+1 OSS)? 
> Bottom line, what I'm after is figuring out what architecture works 
> with inexpensive internal and/or JBOD SAS storage that won't risk data 
> loss with the failure of a single drive or server RAID array...

Lustre does not support redundancy in the file system.  All data 
availability is through RAID protection, combined with server failover.

With internal storage, you lose the failover part.  Sun also delivered 
quite a bit of storage without failover, based on the x4500/x4540 
servers.  If your servers do not crash often, and you can live with the 
file system being down until it is rebooted, that is also an option 
[note that in non-failover mode the file system defaults to returning 
errors rather than hanging, but that can be changed].

Kevin

> Thanks,
>
> Tyler