[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Tue Jul 12 10:44:12 PDT 2011

Hi Kevin,

Thank very much for the reply, answers to your questions are below.

On Jul 12, 2011, at 9:10 AM, Kevin Van Maren wrote:

> Rick Wagner wrote:
>> Hi,
>> 
>> We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error.
>>  
> 
> Yes, evictions are very bad.  Worse than an IO errors, however, is the knowledge that a write that previously "succeeded" never made it out of the client cache to disk (eviction forces client to drop any dirty cache on the floor).
> 
>> The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts.
>>  
> 
> So you have 512 processes on 32 nodes writing to a single file, which exists on a single OST.

No, each task is writing it's own HDF5 file of ~40MB; the total amount of data per write is 20GB. This avoids the need for synchronizing writes to a single file.

> Have you tuned any of the network or Lustre tunables?  For example, max_dirty_mb, max_rpcs_in_flight?  socket buffer sizes?

For network tuning, this is what we have on the clients and servers:

net.core.somaxconn = 10000
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_congestion_control = htcp
net.ipv4.ip_forward = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1

On the Lustre side, max_rpcs_in_flights = 8, max_dirty_mb = 32. We don't have as much experience tuning Lustre, so we've tended to use the defaults.

> What size are the RPCs, application IO sizes?

As I mentioned above, each task is writing a single file, in 5 consecutive 8MB chunks. There are a few other files written by the master tasks, but the failure hasn't occurred on that particular node (yet).

>> The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS:
>>  lustre-oss-0-2: 172.25.33.248
>>  lustre-oss-2-0: 172.25.33.246
>>  lustre-oss-2-2: 172.25.32.118
>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used.
>> 
>> We've been experiencing similar problems for a while, and we've never seen IP traffic have a problem. 
> 
> You are using gigabit Ethernet for Lustre?

The servers are using bonded Myricom 10Gbs cards. On the client side, the nodes have Mellanox QDR InfiniBand HCAs, but we use a Mellanox BridgeX BX 4010, and the clients have virtual 10Gbs NICs. Hence the use of the tcp driver. We do have a problem with setting the MTU on the client side, so currently the servers are using an MTU of 9000, and the client 1500, which means more work for the central 10Gbs switch and the bridge.

> These errors are indicating issues with IP traffic.  When you say you have never seen IP traffic have a problem, you mean "ssh" and "ping" work, or have you stress-tested the network outside Lustre (run network tests from 32 clients to a single server)?

You're right about how I defined IP functionality being available, but that's a good point about stressing the fabric. We've run simultaneous iperf tests, but only until we reach a desired bandwidth. Given our goals, I think the multiple tests will be necessary.

>> But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID.
>> 
>> The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4.
>> 
>> Any advice, or pointers to possible bugs would be appreciated.
>>  
> 
> You have provided no information about your network (NICs/drivers, switches, MTU, settings, etc), but it sounds like you are having network issues, which are exhibiting themselves under load.  It is possible a NIC or the switch is getting overwhelmed by the Lustre traffic, and getting stuck long enough for TCP to time out.

I've tried to fill in the network information above.

> Are the nics or the switch reporting dropped packets?  Any error counters on any links?  Are pause frames enabled on the nics and the switch?

Looking at the servers and clients, we see no dropped packets on the servers, while the clients do have a few (~10 dropped received packets on each client, out of 1e9). There are no errors on either side, so those packets should have been resent if they were coming over LNET. I will be following up with our networking group to find out if the bridge or switch are showing dropped packed, and I'll ask about the pause frames.

> Is that some sort of socket BW test reporting _10_Mb?

I'm not sure what this question refers to. Below, I've posted some of the IOR results we've seen from this system, using 32 and 64 client nodes, 1 task per node. As you can see, the system should have more than enough capacity to handle the IO that I describe above. We've also done and IOR run using 512 tasks, which succeeded, but I don't have the numbers. It's really the intermittency that's drives us crazy.

Thanks again,
Rick

Participating tasks: 32

Summary:
	api                = POSIX
	test filename      = testFile
	access             = file-per-process
	clients            = 32 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 32 GiB
	aggregate filesize = 1024 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     8419       33554432   1024.00    0.051805   124.51     58.80      0   
read      10135      33554432   1024.00    0.002511   103.46     64.72      0   

Max Write: 8418.70 MiB/sec (8827.65 MB/sec)
Max Read:  10134.86 MiB/sec (10627.18 MB/sec)

Participating tasks: 64

Summary:
	api                = POSIX
	test filename      = testFile
	access             = file-per-process
	clients            = 64 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 32 GiB
	aggregate filesize = 2048 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     8554       33554432   1024.00    0.067592   245.12     165.60     0   
read      12284      33554432   1024.00    0.008495   170.73     109.98     0   

Max Write: 8553.74 MiB/sec (8969.25 MB/sec)
Max Read:  12283.52 MiB/sec (12880.20 MB/sec)