[Lustre-discuss] Luster clients getting evicted

Mon Feb 4 09:41:04 PST 2008

Hi Brock,

On Monday 04 February 2008 07:11:11 am Brock Palen wrote:
> on our cluster that has been running lustre for about 1 month. I have
> 1 MDT/MGS and 1 OSS with 2 OST's.
>
> Our cluster uses all Gige and has about 608 nodes 1854 cores.

This seems to be a lot of clients for only one OSS (and thus for only 
one GigE link to the OSS).

> We have allot of jobs that die, and/or go into high IO wait,  strace
> shows processes stuck in fstat().
>
> The big problem is (i think) I would like some feedback on it that of
> these 608 nodes 209 of them have in dmesg the string
>
> "This client was evicted by"
>
> Is this normal for clients to be dropped like this?  

I'm not an expert here, but evictions typically occur when a client 
hasn't been seen for a certain period by the OSS/MDS. This is often 
related to network problems. Considering your number of clients, if 
they all do I/O operations on the filesystem concurrently, maybe your 
Ethernet switches are the bottleneck and have to drop packets. Is your 
GigE network working fine outside of Lustre?

To eliminate networking issues from the equation, you can try to lctl 
ping your MDS and OSS from a freshly evicted node, and see what you 
get. (lctl ping <your-oss-nid>)

> Is there some 
> tuning that needs to be done to the server to carry this many nodes
> out of the box?  We are using default lustre install with Gige.

Do your MDS or OSS show any particularly high load or memory usage? Do 
you see any Lustre-related error messages in their logs?

CHeers,
-- 
Kilian