[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Tue Jul 12 09:10:23 PDT 2011

Rick Wagner wrote:
> Hi,
>
> We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error.
>   

Yes, evictions are very bad.  Worse than an IO errors, however, is the 
knowledge that a write that previously "succeeded" never made it out of 
the client cache to disk (eviction forces client to drop any dirty cache 
on the floor).

> The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts.
>   

So you have 512 processes on 32 nodes writing to a single file, which 
exists on a single OST.

Have you tuned any of the network or Lustre tunables?  For example, 
max_dirty_mb, max_rpcs_in_flight?  socket buffer sizes?

What size are the RPCs, application IO sizes?

> The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS:
>   lustre-oss-0-2: 172.25.33.248
>   lustre-oss-2-0: 172.25.33.246
>   lustre-oss-2-2: 172.25.32.118
> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used.
>
> We've been experiencing similar problems for a while, and we've never seen IP traffic have a problem. 

You are using gigabit Ethernet for Lustre?

These errors are indicating issues with IP traffic.  When you say you 
have never seen IP traffic have a problem, you mean "ssh" and "ping" 
work, or have you stress-tested the network outside Lustre (run network 
tests from 32 clients to a single server)?

> But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID.
>
> The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4.
>
> Any advice, or pointers to possible bugs would be appreciated.
>   

You have provided no information about your network (NICs/drivers, 
switches, MTU, settings, etc), but it sounds like you are having network 
issues, which are exhibiting themselves under load.  It is possible a 
NIC or the switch is getting overwhelmed by the Lustre traffic, and 
getting stuck long enough for TCP to time out.

Are the nics or the switch reporting dropped packets?  Any error 
counters on any links?  Are pause frames enabled on the nics and the switch?

Is that some sort of socket BW test reporting _10_Mb?

Kevin

> Thanks,
> Rick
>