[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Rick Wagner rpwagner at sdsc.edu
Mon Jul 11 15:39:34 PDT 2011


Hi,

We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error.

The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts.

The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS:
  lustre-oss-0-2: 172.25.33.248
  lustre-oss-2-0: 172.25.33.246
  lustre-oss-2-2: 172.25.32.118
I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used.

We've been experiencing similar problems for a while, and we've never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID.

The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4.

Any advice, or pointers to possible bugs would be appreciated.

Thanks,
Rick

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: server-errors.txt
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110711/39358645/attachment.txt>
-------------- next part --------------

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: client.txt
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110711/39358645/attachment-0001.txt>


More information about the lustre-discuss mailing list