[Lustre-discuss] Failover / reliability using SAD direct-attached storage

Sun Jul 24 07:25:12 PDT 2011

Mark Hahn wrote:
>> It seems an external fibre
>> or SAS raid is needed,
>>     
>
> to be precise, a redundant-path SAN is needed.  you could do it with 
> commodity disks and Gb, or you can spend almost unlimited amounts on 
> gold-plated disks, FC switches, etc.
>   
Many deployments are done without redundant paths, which offer 
additional insurance.

> the range of costs is really quite remarkable, I guess O(100x). 
> compare this to cars where even VERY nice production cars are only 
> a few times more expensive than the most cost-effective ones.
>   

You're comparing two mass-market cars: there is a nearly 1000x 
difference in price
between a cheap dune buggy and a Bugatti, but both provide 
transportation for 1-2 people.

>> as the idea of loosing the file system if one
>> node goes down doesn't seem good, even if temporary.
>>     

The clients should just hang on the file system until the server is 
again available.
This is not so different from using NFS with hard mounts.

Note that even with failover, the Lustre file system will be down for 
several
minutes, as the HA package has to first detect a problem, and then 
safely startup
Lustre on the backup server, and then Lustre recovery has to occur.

> how often do you expect nodes to fail, and why?
>
> regards, mark hahn.
>