[Lustre-discuss] Problems with failover

Thu Jan 3 16:34:43 PST 2008

On Jan 03, 2008  16:35 -0600, Jeremy Mann wrote:
> We have come across two situations where we've had to rebuild our Lustre
> filesystem. Both happened when one of the OSTs hard drives failed. We
> did set the OSTs up for failover, however the network was never
> interrupted so the switch to the failover node never happened.
> 
> How exactly should failover work?

To be clear - Lustre failover has nothing to do with data replication.
It is meant only as a mechanism to allow high-availability of shared
disk.  This means - more than one node can serve shared disk from a
SAN or multi-port FC/SCSI disks.

You currently need another mechanism (hardware or software RAID) to 
provide data redundancy in case of disk failure.  We are working to
provide data replication at the Lustre level, but that is not yet
available.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.