[Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

Mon Sep 20 08:36:54 PDT 2010

Bernd Schubert wrote:
> Hello Cory,
>
> On 09/17/2010 11:31 PM, Cory Spitz wrote:
>   
>> Hi, Bernd.
>>
>> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>>     
>>> On Friday, September 17, 2010, Andreas Dilger wrote:
>>>       
>>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
>>>>         
>>>>> We're trying to architect a Lustre setup for our group, and want to
>>>>> leverage our available resources. In doing so, we've come to consider
>>>>> multi-purposing several hosts, so that they'll function simultaneously
>>>>> as MDS & OSS.
>>>>>           
>>>> You can't do this and expect recovery to work in a robust manner.  The
>>>> reason is that the MDS is a client of the OSS, and if they are both on the
>>>> same node that crashes, the OSS will wait for the MDS "client" to
>>>> reconnect and will time out recovery of the real clients.
>>>>         
>>> Well, that is some kind of design problem. Even on separate nodes it can 
>>> easily happen, that both MDS and OSS fail, for example power outage of the 
>>> storage rack. In my experience situations like that happen frequently...
>>>
>>>       
>> I think that just argues that the MDS should be on a separate UPS.
>>     

Or dual-redundant UPS devices driving all "critical infrastructure".  
Redundant power supplies
are the norm for server-class hardware, and they should be cabled to 
different circuits (which
each need to be sized to sustain the maximum power).

> well, there is not only a single reason. Next hardware issue is that
> maybe an IB switch fails. 

Sure, but that's also easy to address (in theory): put OSS nodes on 
different leaf switches than
MDS nodes, and put the failover pairs on different switches as well.

In practice, IB switches probably do not fail often enough to worry 
about recovery glitches,
especially if they have redundant power, but I certainly recommend 
failover partners are on
different switch chips so that in case of a failure it is still possible 
to get the system up.

I would also recommend using bonded network interfaces to avoid 
cable-failure issues (ie,
connect both OSS nodes to both of the leaf switches, rather than one to 
each), but there are
some outstanding issues with Lustre on IB bonding (patches in bugzilla), 
and of course
multipath to disk (loss of connectivity to disk was mentioned at LUG as 
one of the
biggest causes of Lustre issues).  In general it is easier to have 
redundant cables than to
ensure your HA package properly monitors cable status and does a 
failover when required.

> And then have also seen cascading Lustre
> failures. It starts with an LBUG on the OSS, which triggers another
> problem on the MDS...
>   
Yes, that's why bugs are fixed.  panic_on_lbug may help stop the problem 
before it spreads,
depending on the issue.

> Also, for us this actually will become a real problem, which cannot be
> easily solved. So this issue will become a DDN priority.
>
>
> Cheers,
> Bernd
>
> --
> Bernd Schubert
> DataDirect Networks
>
>