[Lustre-discuss] Luster clients getting evicted
Kilian CAVALOTTI
kilian at stanford.edu
Mon Feb 4 10:43:58 PST 2008
On Monday 04 February 2008 10:17:37 am Brock Palen wrote:
> The
> cluster IS to big, but there isn't a person at the university who is
> willing to pay for anything other than more cluster nodes. Enough
> with politics.
That's the first time I hear a cluster is too big, people usually
complain about the contrary. :)
But the second part sounds very very familiar, though... Anyway.
> I just had another node get evicted while running code causing the
> code to lock up. This time it was the MDS that evicted it. Pinging
> work though:
>
> [root at nyx350 ~]# lctl ping 141.212.30.184 at tcp
> 12345-0 at lo
> 12345-141.212.30.184 at tcp
Ok.
> I have attached the output of lctl dk from the client and some
> syslog messages from the MDS.
(recover.c:188:ptlrpc_request_handle_notconn()) import
nobackup-MDT0000-mdc-000001012bd27c00 of
nobackup-MDT0000_UUID at 141.212.30.184@tcp abruptly disconnected:
reconnecting
(import.c:133:ptlrpc_set_import_discon())
nobackup-MDT0000-mdc-000001012bd27c00: Connection to service
nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost;
I will let Lustre people comment on this, but this sure looks like a
network problem to me.
Is there any information you can get out of the switches (logs, dropped
packets, retries, stats, anything)?
> Nope both servers have 2GB ram, and load is almost 0. No swapping.
Do you see dropped packets or errors in your ifconfig output, on the
servers and/or clients?
Cheers,
--
Kilian
More information about the lustre-discuss
mailing list