[Lustre-discuss] Redundancy with object storage?

Thu Dec 6 04:58:50 PST 2007

Dear Joe, Dante,

Apologies in advance about not replying inline to your comments.

I am getting the impression here that DRBD is being considered as a
"remote" mirroring solution which makes it seems like the secondary
oss housing the backup OST is sitting far far away making it
unreliable or inefficient. Side note: DRBD+ does have the provision of
allowing mirroring of data to a third node which will replicate
asynchronously (quite customizable really).

One can configure independent network routes for DRBD replication,
which are synchronous btw, and with heartbeat in the picture and a NPS
accounted for, the overall deployment can absolutely have a very
reliable, highly available and robust architecture coupling the
various technologies being discussed.

Our company uses a small Lustre cluster in the above configuration
whereas two of our clients (both financial houses) have similar
clustered solutions, which admittedly are small (3 TBs approximately
serving a no more than 20 clients), catering to core applications.

DRBD / local storage / HA and Lustre would require a bit of know-how
to put together, however,  if cost is an issue (or even sometimes when
it's not), it's absolutely worth a look into. We've been running
happily for months now -- with many many fail-overs :)

mustafa.

On Dec 6, 2007 5:19 PM, Fegan, Joe <Joe.Fegan at hp.com> wrote:
> D. Dante Lorenso wrote:
>
> > Is it possible to configure Lustre to write Objects to more than 1 node
> > simultaneously such that I am guaranteed that if one node goes down that
> > all files are still accessible?
>
> As Brian Murrell said earlier, if the data for a certain OST or MDS is visible to only one node then you will lose access to that data when that node is down. Continuous replication of the data is one approach, but commercial Lustre implementations today typically use shared storage hardware instead.
>
> HP's Lustre-based product (SFS) for example, places all Lustre data on shared disks and uses clustering software to nominate one node as the primary for each Lustre service and another node as the backup. We configure the server nodes in pairs for redundancy; node A is the primary server for OST1 and secondary for OST2, node B is primary for OST2 and secondary for OST1. This means that as long as either A or B is up clients will have access to both OST1 and OST2. This sounds like the sort of configuration you are looking for. To make it work you absolutely need both A and B to be able to see the data for both OST1 and OST2, though only one of them will be serving each OST at a given time of course (if both nodes try to serve the same OST at the same time the underlying ext3 filesystem will get corrupted so fast it'll make your head spin).
>
> > It is a delicate mounting/unmounting game to ensure that partitions are
> > monitored, mounted, and fail-over in just the right order.
>
> Absolutely right, this is the hard bit.
>
> I have no personal experience of DRDB but from their website I see that it's remote disk mirroring software that works by sending notifications of all changes to a local disk to a remote node. The remote node makes the same changes to one of its local disks, making that disk a sort of remote mirror of the one on the original node. Like long distance RAID1. You could also think of it as a shared storage emulator in software and with that in mind you can see where it would fit into the architecture I outlined above.
>
> Having said that, I'm not aware of anyone using DRDB in a Lustre environment, so can't comment on how well it works. Maybe others on this list have experience with it and can comment better. I'd be a bit concerned about the timeliness of updates to the remote mirror, whether the latency would cause problems after a failover (though DRDB does support ext3 and these are ext3 filesystems under the hood, albeit heavily modified). I'd also wonder about performance with change notifications for every write being sent over ethernet to the other node, though I'm sure you've thought about that aspect already.