[Lustre-discuss] LBUG

Wojciech Turek wjt27 at cam.ac.uk
Fri Nov 16 07:53:39 PST 2007


Hi,

Indeed client has disconnected from MDS. We actually see that quite  
frequently during OSS failover also on other clients. Is that  
indicates that during OSS failure MDS is very busy and clients <->  
mds connections timeout? Is there a way to prevent from such a  
situation maybe some MDS tuning?
I don't really see the reason why clients could not communicate with  
MDS while only OSS is having problem.

Cheers,

Wojciech


On 16 Nov 2007, at 15:46, Oleg Drokin wrote:

> Hello!
>
> On Nov 16, 2007, at 8:43 AM, Wojciech Turek wrote:
>>>> We've seen LBUG message today. It happened during failover of one
>>>> OSS's to another one.
>>> Actually messages suggest that there was mds failover as well.
>> Can you specify which messages suggest that ? I am asking because  
>> as far as I can see there was no MDS failover. We have failover  
>> configured with heartbeat I can see everything stayed on the same  
>> server.
>
> Nov 15 22:10:14 darwin kernel: Lustre: ddn_home-MDT0000-
> mdc-00000100cff22800: Connection restored to service ddn_home-MDT0000
> using nid 10.143.245.201 at tcp.
>
> This message means that connection was restored to your MDS.
> I cannot tell if it was indeed failover (sorry, I used wrong word),  
> but I can tell this client disconnected from MDS previously and  
> later reconnected to it by this message.
> I assumed since you were speaking of failovers MDS might have been  
> failed over as well (due to disconnection), but this is not  
> necessary the case.
>
> Bye,
>     Oleg

Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20071116/53be2870/attachment.htm>


More information about the lustre-discuss mailing list