[Lustre-discuss] Redundancy with object storage?

Tue Dec 4 15:59:34 PST 2007

Brian J. Murrell wrote:
> On Mon, 2007-12-03 at 20:32 -0600, D. Dante Lorenso wrote:
>> The problem that I see is that if any one 
>> piece of the 4-node system fails, the whole system will fail.
> 
> Not quite.  If an OST fails, only the objects on that OST become failed.
> The filesystem will continue to run and service requests from the
> available OSTs.  That means, depending on striping policies, that some
> or all of a file's contents might not be available.

What happens when you try to read a file from the OST that is down?  I'm 
guessing that read will hang for a considerable period of time.  Likely 
that hanging will eventually occur for many files on a box 
simultaneously and the whole box will lock up waiting on I/O it will 
never get ... essentially taking the whole shebang down.

>> Is it possible to configure Lustre to write Objects to more than 1 node 
>> simultaneously such that I am guaranteed that if one node goes down that 
>> all files are still accessible?
> That's called RAID, and right now, no.  It's on the roadmap though.

Is the road map posted somewhere?  URL?  Any timeline I might want to 
watch and wait for?

>> This would effectively mean that I 
>> would use 2 times the storage space for each object written and would 
>> require that every cluster have a minimum of 2 nodes.
> This is a description of mirroring.

Right, like RAID 1, but at the network level.

>> I understand the concepts of using DRBD and replicating block devices as 
>> well as creating a full separate cluster for fail-over, but I'm hoping 
>> to build redundancy into a single cluster without having to duplicate my 
>> network with a bunch of active/passive machine combinations.
> 
> Using a reliable (some form of RAID -- which drbd qualifies as) shared
> storage device is the only way to mitigate the SPOF scenario with Lustre
> currently.
> 
> As far as duplication, etc. there is no reason why you cannot mirror
> your drbd devices amongst the hardware you have currently (i.e pair your
> machines up and create drbd based, mirrored devices) and along with some
> form of failover (i.e. HA heartbeat) to get some redundancy.
> 
> If you were already resigned to halving your storage by mirroring OSTs
> at the Lustre layer, dropping that mirroring down to drbd should not
> impose any more significant costs.  Or maybe I'm misunderstanding your
> concerns with using drbd.

I have configured a DRBD system with heartbeat in my lab tests and it 
seems to work well enough, but I haven't tied it into Lustre just yet. 
I was concerned about the frailty of a system that requires all 3 
(lustre, drbd, and heartbeat) to magically work in unison.

It is a delicate mounting/unmounting game to ensure that partitions are 
monitored, mounted, and fail-over in just the right order.  Eliminating 
all the moving parts by using 1 solution like Lustre was what I was 
hoping for.

I'm leaning toward doing the L,D,H solution, but was really hoping for 
something easier.  Are there any online howtos that demonstrate that 
configuration?

-- Dante