[Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts

Robert Read rread at whamcloud.com
Fri Sep 17 21:31:03 PDT 2010


Hi,

On Sep 17, 2010, at 14:49 , Bernd Schubert wrote:

> Hello Cory,
> 
> On 09/17/2010 11:31 PM, Cory Spitz wrote:
>> Hi, Bernd.
>> 
>> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>>> On Friday, September 17, 2010, Andreas Dilger wrote:
>>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
>>>>> We're trying to architect a Lustre setup for our group, and want to
>>>>> leverage our available resources. In doing so, we've come to consider
>>>>> multi-purposing several hosts, so that they'll function simultaneously
>>>>> as MDS & OSS.
>>>> 
>>>> You can't do this and expect recovery to work in a robust manner.  The
>>>> reason is that the MDS is a client of the OSS, and if they are both on the
>>>> same node that crashes, the OSS will wait for the MDS "client" to
>>>> reconnect and will time out recovery of the real clients.
>>> 
>>> Well, that is some kind of design problem. Even on separate nodes it can 
>>> easily happen, that both MDS and OSS fail, for example power outage of the 
>>> storage rack. In my experience situations like that happen frequently...
>>> 
>> 
>> I think that just argues that the MDS should be on a separate UPS.
> 
> well, there is not only a single reason. Next hardware issue is that
> maybe an IB switch fails. And then have also seen cascading Lustre
> failures. It starts with an LBUG on the OSS, which triggers another
> problem on the MDS...
> Also, for us this actually will become a real problem, which cannot be
> easily solved. So this issue will become a DDN priority.

There is always a possibility that multiple failures will occur, and this possibility can 
be reduced depending on one's resources. The point here is simply that  a 
configuration with an mds and oss  on the same node will guarantee multiple 
failures and aborted OSS recovery when that node fails.

cheers,
robert




More information about the lustre-discuss mailing list