[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Sat Jul 23 11:34:35 PDT 2011

Thank you for the detailed response, Kevin. It seems an external fibre
or SAS raid is needed, as the idea of loosing the file system if one
node goes down doesn't seem good, even if temporary. If Lustre allowed
for a single downed node I'd feel differently  However, it does have
me thinking of building a 2nd cluster for backup/replication, and that
one could use cheap sata internal storage since it is only for
nearline use, really.

On Jul 22, 2011, at 8:01 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:

> Tyler Hawes wrote:
>> Apologies if this is a bit newbie, but I'm just getting started, really. I'm still in design / testing stage and looking to wrap my head around a few things.
>>
>> I'm most familiar with Fibre Channel storage. As I understand it, you configure a pair of OSS per OST, one actively serving it, the other passively waiting in case the primary OSS fails. Please correct me if I'm wrong...
> No, that's basically it.  Lustre works well with FC storage, although a full SAN configuration (redundant switch fabrics) is not often used: with only 2 servers needing access to each LUN, and bandwidth to storage being key, servers are most often directly attached to the FC storage, with multiple paths to handle controller/path failure and improve BW.
>
> But to clarify one point, Lustre is not waiting passively on the backup server.  Lustre can only be active on one server for a given OST at a time.  Some high-availability package, external to Lustre, is responsible for ensuring Lustre is active on one server (the OST is mounted on one server).  Heartbeat was quite popular, but more people have been moving to the more modern packages like Pacemaker.  It is left to the HA package to perform failover as necessary, even though most HA packages do not perform failover by default if the network or back-end storage link goes down (which is where bonded networks and storage multipath could come in).
>
>> With SAS/SATA direct-attached storage (DAS), though, it's a little less clear to me. With SATA, I imagine that if an OSS goes down, all it's OSTs go down with it (whether they be internal or external mounted drives), since there is no multipathing. Also, I suppose I'd want a hardware RAID controller PCIe card, which would also preclude failover since it's not going to have cache and configuration mirrored in another OSS's RAID card.
>
> Normally, yes.  Sun shipped quite a bit of Lustre stoage with failover using SATA in external enclosures (J4400), but that was special in that there were (2) SAS expanders per enclosure, and each drive was connected to a SATA MUX to allow both servers access to the SATA drives.
>
> I am glad you understand the hazards of connecting two servers using internal raid controllers with external storage.  Until a RAID card is developed specifically designed with that in mind (and strictly uses a write-though cache), it is a very bad idea.  [For others, please consider what would happen to the file system if the raid card has a battery backed cache with a bunch of pending writes to get replayed at some point _after_ the other server completes recovery.]
>
> If you are using a SAS-attached external RAID enclosure, then it is not much different than using a FC-attached RAID.  Ie, the direct-attached ST2530 (SAS) can be used in place of a direct-attached ST2540 (FC) with the only architecture change being the use of a SAS card/cables instead of an FC card/cables.  The big difference between SAS and FC is that people are not (yet) building SAS-based SANs.  Already many FC arrays have moved to SAS drives on the back end.
> http://www.oracle.com/us/products/servers-storage/storage/disk-storage/sun-storage-2500-m2-array-407918.html
>
>> With SAS, there seems to be a new way of doing this that I'm just starting to learn about, but is a bit fuzzy still to me. I see that with things like Storage Bridge Bay storage servers from the likes of Supermicro, there is a method of putting two server motherboards in one enclosure, having an internal 10GigE link between them to keep cache coherency, some sort of software layer to manage that (?), and then you can use inexpensive SAS drives internally and through external JBOD chassis. Is anyone using something like this with Lustre?
>
> Some people have used (or at least toyed with using) DRDB and Lustre, but I would not say it is fast, recommended, or a mainstream Lustre configuration.  But that is one way to replicate internal storage across servers, to allow Lustre failover.
>
> With SAS drives in an external enclosure, it is possible to configure shared storage for use with Lustre, although if you are using a JBOD rather than a raid controller, there are the normal issues (Linux SW raid/LVM layers are not "clustered", so you have to ensure they are only active on one node at a time).
>
>> Or perhaps I'm not seeing the forest through the trees and Lustre has software features built-in that negate the need for this (such as parity of objects at the server level, so you can loose N+1 OSS)? Bottom line, what I'm after is figuring out what architecture works with inexpensive internal and/or JBOD SAS storage that won't risk data loss with the failure of a single drive or server RAID array...
>
> Lustre does not support redundancy in the file system.  All data availability is through RAID protection, combined with server failover.
>
> With internal storage, you lose the failover part.  Sun also delivered quite a bit of storage without failover, based on the x4500/x4540 servers.  If your servers do not crash often, and you can live with the file system being down until it is rebooted, that is also an option [note that in non-failover mode the file system defaults to returning errors rather than hanging, but that can be changed].
>
>
> Kevin
>
>> Thanks,
>>
>> Tyler
>