[Lustre-discuss] Problems with failover

Fri Jan 4 08:10:54 PST 2008

On Thu, 2008-01-03 at 17:34 -0700, Andreas Dilger wrote:

> To be clear - Lustre failover has nothing to do with data replication.
> It is meant only as a mechanism to allow high-availability of shared
> disk.  This means - more than one node can serve shared disk from a
> SAN or multi-port FC/SCSI disks.

How would one build a reliable system with 20 OSTs? Our system contains
20 compute nodes, each with 2 200GB drives in a RAID0 configuration.
Each node acts as an OST and a failover of each other, i.e. 0-1, 1-2,
3-4, etc..

I can start from scratch, so I'm thinking of rebuilding the RAID arrays
with RAID1 to compensate for disk failures. But that still leaves me
questioning if a node goes down, or we lose another drive, if we'll be
back to the same problems we've been having.

-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu

University of Texas Health Science Center 
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: 210-567-2672