[Lustre-discuss] Redundancy with object storage?

Thu Dec 6 04:19:34 PST 2007

D. Dante Lorenso wrote:

> Is it possible to configure Lustre to write Objects to more than 1 node
> simultaneously such that I am guaranteed that if one node goes down that
> all files are still accessible?

As Brian Murrell said earlier, if the data for a certain OST or MDS is visible to only one node then you will lose access to that data when that node is down. Continuous replication of the data is one approach, but commercial Lustre implementations today typically use shared storage hardware instead.

HP's Lustre-based product (SFS) for example, places all Lustre data on shared disks and uses clustering software to nominate one node as the primary for each Lustre service and another node as the backup. We configure the server nodes in pairs for redundancy; node A is the primary server for OST1 and secondary for OST2, node B is primary for OST2 and secondary for OST1. This means that as long as either A or B is up clients will have access to both OST1 and OST2. This sounds like the sort of configuration you are looking for. To make it work you absolutely need both A and B to be able to see the data for both OST1 and OST2, though only one of them will be serving each OST at a given time of course (if both nodes try to serve the same OST at the same time the underlying ext3 filesystem will get corrupted so fast it'll make your head spin).

> It is a delicate mounting/unmounting game to ensure that partitions are
> monitored, mounted, and fail-over in just the right order.

Absolutely right, this is the hard bit.

I have no personal experience of DRDB but from their website I see that it's remote disk mirroring software that works by sending notifications of all changes to a local disk to a remote node. The remote node makes the same changes to one of its local disks, making that disk a sort of remote mirror of the one on the original node. Like long distance RAID1. You could also think of it as a shared storage emulator in software and with that in mind you can see where it would fit into the architecture I outlined above.

Having said that, I'm not aware of anyone using DRDB in a Lustre environment, so can't comment on how well it works. Maybe others on this list have experience with it and can comment better. I'd be a bit concerned about the timeliness of updates to the remote mirror, whether the latency would cause problems after a failover (though DRDB does support ext3 and these are ext3 filesystems under the hood, albeit heavily modified). I'd also wonder about performance with change notifications for every write being sent over ethernet to the other node, though I'm sure you've thought about that aspect already.

Joe.

-----Original Message-----
From: Brian J. Murrell [mailto:Brian.Murrell at Sun.COM]
Sent: 05 December 2007 14:47
To: lustre-discuss
Subject: Re: [Lustre-discuss] Redundancy with object storage?

On Tue, 2007-12-04 at 17:59 -0600, D. Dante Lorenso wrote:
>
> What happens when you try to read a file from the OST that is down?

That depends on whether the OST has been configured for failout or
failover.  In failover mode, the assumption is that another node will
resume service for that OST, so I/O to objects on the failed OST will
block, waiting for the service to be resumed.  In failout mode, I/O to
the failed OST will return EIOs.

> I'm
> guessing that read will hang for a considerable period of time.

For ever, or until the OST is repaired in the case of failover, yes.

> Likely
> that hanging will eventually occur for many files on a box

On a given client, yes.

> simultaneously and the whole box will lock up waiting on I/O it will
> never get

No.  Having even a lot of processes blocked on I/O to a failed OST will
not "lock up" a whole client.  The client will continue to run and
complete tasks that are not dependent on the failed OST.

>  ... essentially taking the whole shebang down.

I guess it depends on how you define shebang.

> Is the road map posted somewhere?

First (non-ad-sponsored) hit on google for "lustre roadmap":
http://www.clusterfs.com/roadmap.html

>   URL?  Any timeline I might want to
> watch and wait for?

Server Network Striping.  Looks like 2.0 in Q4 2008.

> Right, like RAID 1, but at the network level.

Which is what drbd is effectively.

> I have configured a DRBD system with heartbeat in my lab tests and it
> seems to work well enough, but I haven't tied it into Lustre just yet.

Adding Lustre should not be a big hurdle.

> It is a delicate mounting/unmounting game to ensure that partitions are
> monitored, mounted, and fail-over in just the right order.

Indeed.

> I'm leaning toward doing the L,D,H solution, but was really hoping for
> something easier.  Are there any online howtos that demonstrate that
> configuration?

I don't know of any HOWTO/cookbook to it.  If you implement it, perhaps
you could create the HOWTO.  :-)

b.