[Lustre-discuss] Multi-Role/Tasking MDS/OSS Hosts
Robert Read
rread at whamcloud.com
Fri Sep 17 21:31:03 PDT 2010
Hi,
On Sep 17, 2010, at 14:49 , Bernd Schubert wrote:
> Hello Cory,
>
> On 09/17/2010 11:31 PM, Cory Spitz wrote:
>> Hi, Bernd.
>>
>> On 09/17/2010 02:48 PM, Bernd Schubert wrote:
>>> On Friday, September 17, 2010, Andreas Dilger wrote:
>>>> On 2010-09-17, at 12:42, Jonathan B. Horen wrote:
>>>>> We're trying to architect a Lustre setup for our group, and want to
>>>>> leverage our available resources. In doing so, we've come to consider
>>>>> multi-purposing several hosts, so that they'll function simultaneously
>>>>> as MDS & OSS.
>>>>
>>>> You can't do this and expect recovery to work in a robust manner. The
>>>> reason is that the MDS is a client of the OSS, and if they are both on the
>>>> same node that crashes, the OSS will wait for the MDS "client" to
>>>> reconnect and will time out recovery of the real clients.
>>>
>>> Well, that is some kind of design problem. Even on separate nodes it can
>>> easily happen, that both MDS and OSS fail, for example power outage of the
>>> storage rack. In my experience situations like that happen frequently...
>>>
>>
>> I think that just argues that the MDS should be on a separate UPS.
>
> well, there is not only a single reason. Next hardware issue is that
> maybe an IB switch fails. And then have also seen cascading Lustre
> failures. It starts with an LBUG on the OSS, which triggers another
> problem on the MDS...
> Also, for us this actually will become a real problem, which cannot be
> easily solved. So this issue will become a DDN priority.
There is always a possibility that multiple failures will occur, and this possibility can
be reduced depending on one's resources. The point here is simply that a
configuration with an mds and oss on the same node will guarantee multiple
failures and aborted OSS recovery when that node fails.
cheers,
robert
More information about the lustre-discuss
mailing list