[Lustre-discuss] Luster clients getting evicted

Kilian CAVALOTTI kilian at stanford.edu
Mon Feb 4 10:43:58 PST 2008


On Monday 04 February 2008 10:17:37 am Brock Palen wrote:
> The
> cluster IS to big, but there isn't a person at the university who is
> willing to pay for anything other than more cluster nodes.  Enough
> with politics.

That's the first time I hear a cluster is too big, people usually 
complain about the contrary. :)
But the second part sounds very very familiar, though... Anyway.

> I just had another node get evicted while running code causing the
> code to lock up.  This time it was the MDS that evicted it.  Pinging
> work though:
>
> [root at nyx350 ~]# lctl ping 141.212.30.184 at tcp
> 12345-0 at lo
> 12345-141.212.30.184 at tcp

Ok.

> I have attached the output of lctl dk  from the client and some
> syslog messages from the MDS.

(recover.c:188:ptlrpc_request_handle_notconn()) import 
nobackup-MDT0000-mdc-000001012bd27c00 of 
nobackup-MDT0000_UUID at 141.212.30.184@tcp abruptly disconnected: 
reconnecting
(import.c:133:ptlrpc_set_import_discon()) 
nobackup-MDT0000-mdc-000001012bd27c00: Connection to service 
nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; 

I will let Lustre people comment on this, but this sure looks like a 
network problem to me.

Is there any information you can get out of the switches (logs, dropped 
packets, retries, stats, anything)?

> Nope both servers have 2GB ram, and load is almost 0.  No swapping.

Do you see dropped packets or errors in your ifconfig output, on the 
servers and/or clients?

Cheers,
-- 
Kilian



More information about the lustre-discuss mailing list