[Lustre-discuss] Failover / reliability using SAD direct-attached storage
Kevin Van Maren
kevin.van.maren at oracle.com
Fri Jul 22 08:00:55 PDT 2011
Tyler Hawes wrote:
> Apologies if this is a bit newbie, but I'm just getting started,
> really. I'm still in design / testing stage and looking to wrap my
> head around a few things.
> I'm most familiar with Fibre Channel storage. As I understand it, you
> configure a pair of OSS per OST, one actively serving it, the other
> passively waiting in case the primary OSS fails. Please correct me if
> I'm wrong...
No, that's basically it. Lustre works well with FC storage, although a
full SAN configuration (redundant switch fabrics) is not often used:
with only 2 servers needing access to each LUN, and bandwidth to storage
being key, servers are most often directly attached to the FC storage,
with multiple paths to handle controller/path failure and improve BW.
But to clarify one point, Lustre is not waiting passively on the backup
server. Lustre can only be active on one server for a given OST at a
time. Some high-availability package, external to Lustre, is
responsible for ensuring Lustre is active on one server (the OST is
mounted on one server). Heartbeat was quite popular, but more people
have been moving to the more modern packages like Pacemaker. It is left
to the HA package to perform failover as necessary, even though most HA
packages do not perform failover by default if the network or back-end
storage link goes down (which is where bonded networks and storage
multipath could come in).
> With SAS/SATA direct-attached storage (DAS), though, it's a little
> less clear to me. With SATA, I imagine that if an OSS goes down, all
> it's OSTs go down with it (whether they be internal or external
> mounted drives), since there is no multipathing. Also, I suppose I'd
> want a hardware RAID controller PCIe card, which would also preclude
> failover since it's not going to have cache and configuration mirrored
> in another OSS's RAID card.
Normally, yes. Sun shipped quite a bit of Lustre stoage with failover
using SATA in external enclosures (J4400), but that was special in that
there were (2) SAS expanders per enclosure, and each drive was connected
to a SATA MUX to allow both servers access to the SATA drives.
I am glad you understand the hazards of connecting two servers using
internal raid controllers with external storage. Until a RAID card is
developed specifically designed with that in mind (and strictly uses a
write-though cache), it is a very bad idea. [For others, please
consider what would happen to the file system if the raid card has a
battery backed cache with a bunch of pending writes to get replayed at
some point _after_ the other server completes recovery.]
If you are using a SAS-attached external RAID enclosure, then it is not
much different than using a FC-attached RAID. Ie, the direct-attached
ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with
the only architecture change being the use of a SAS card/cables instead
of an FC card/cables. The big difference between SAS and FC is that
people are not (yet) building SAS-based SANs. Already many FC arrays
have moved to SAS drives on the back end.
> With SAS, there seems to be a new way of doing this that I'm just
> starting to learn about, but is a bit fuzzy still to me. I see that
> with things like Storage Bridge Bay storage servers from the likes of
> Supermicro, there is a method of putting two server motherboards in
> one enclosure, having an internal 10GigE link between them to keep
> cache coherency, some sort of software layer to manage that (?), and
> then you can use inexpensive SAS drives internally and through
> external JBOD chassis. Is anyone using something like this with Lustre?
Some people have used (or at least toyed with using) DRDB and Lustre,
but I would not say it is fast, recommended, or a mainstream Lustre
configuration. But that is one way to replicate internal storage across
servers, to allow Lustre failover.
With SAS drives in an external enclosure, it is possible to configure
shared storage for use with Lustre, although if you are using a JBOD
rather than a raid controller, there are the normal issues (Linux SW
raid/LVM layers are not "clustered", so you have to ensure they are only
active on one node at a time).
> Or perhaps I'm not seeing the forest through the trees and Lustre has
> software features built-in that negate the need for this (such as
> parity of objects at the server level, so you can loose N+1 OSS)?
> Bottom line, what I'm after is figuring out what architecture works
> with inexpensive internal and/or JBOD SAS storage that won't risk data
> loss with the failure of a single drive or server RAID array...
Lustre does not support redundancy in the file system. All data
availability is through RAID protection, combined with server failover.
With internal storage, you lose the failover part. Sun also delivered
quite a bit of storage without failover, based on the x4500/x4540
servers. If your servers do not crash often, and you can live with the
file system being down until it is rebooted, that is also an option
[note that in non-failover mode the file system defaults to returning
errors rather than hanging, but that can be changed].
More information about the lustre-discuss