[Lustre-discuss] I/O error on clients

Christopher J. Morrone morrone2 at llnl.gov
Tue Jul 6 17:12:44 PDT 2010


On 07/05/2010 11:19 PM, Peter Kitchener wrote:
> Hi all,
>
> I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error.
>
> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup.  All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2.  The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3
>
> So far I've not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don't make much sense to me.
>
> Here is the messages output from the OSS
>
> ##############################
>
> Jul  6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 227 seconds. I think it's dead, and I am evicting it.

There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get 
stuck for long periods of time (around 10 minutes was the longest that I 
saw) on lustre clients under certain work loads.  If the ptlrpcd is 
dead, the client may stop sending all RPCs to the servers, and the 
servers evict the client because they haven't heard from it in a while.

See bug 22897 for a description of the bug.  But the fix is a simple 
one-liner in bug 22786, attachment 29866.  The fix will first appear in 
lustre 1.8.4.  I would highly recommend to anyone using 1.8.2 or 1.8.3 
that they add that patch.

I don't know if that is the cause of your particular evictions, because 
there can be many causes of evictions.  But the "haven't hear from 
client ... in 227 seconds" was one of the symptoms, and the client 
failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at 
the same time was another symptom.

Chris



More information about the lustre-discuss mailing list