[Lustre-discuss] I/O error on clients

Gabriele Paciucci paciucci at gmail.com
Wed Jul 7 01:04:04 PDT 2010


Hi,
the ptlrcp bug is a problem, but i don't find in the Peter's logs any 
refer to an eviction caused by the ptlrpc but instead by a timeout 
during the comunication between a ost and the client. But Peter could 
make a downgrade to 1.8.1.1 that not suffer by the problem.


My action plan could be:

1. First of all Peter use the same lustre between client and server 1.8.3.

2. Second check the /proc/sys/lustre/ldlm_timeout : 6 sec for MDS, 20 
sec for OSS!!

3. Third: do you have enough memory on the servers for all the clients 
locks? Please refer to: 
http://wiki.lustre.org/manual/LustreManual18_HTML/LustreProc.html#50417791_pgfId-1290875 

Normally the server could suffer for more than 500k locks.

bye

On 07/07/2010 02:12 AM, Christopher J. Morrone wrote:
> On 07/05/2010 11:19 PM, Peter Kitchener wrote:
>    
>> Hi all,
>>
>> I have been troubleshooting a strange problem that is occurring with our Lustre setup. Under high loads our developers are complaining that various processes they run will error out with I/O error.
>>
>> Our setup is small 1 MDS and 2 OSS(10OSTs 5/OSS), and 13 Clients (152 Cores) the storage is all local 60TB (30TB/OSS) usable in a RAID6 Software raid setup.  All of the machines are connected via 10Gig Ethernet. The clients run Rocks 5.3 (CentOS 5.4) and the Servers run CentOS 5.4 with kernel 2.6.18-164.11.1.el5_lustre.1.8.2.  The Clients run an un-patched vanilla kernel from CentOS and Lustre 1.8.3
>>
>> So far I've not been able to pin point where i should begin to look. I have been trawling through log files that quite frankly don't make much sense to me.
>>
>> Here is the messages output from the OSS
>>
>> ##############################
>>
>> Jul  6 14:57:11 helium kernel: Lustre: AC3-OST0005: haven't heard from client ce1a3eb7-8514-d16e-4050-0507e82f1116 (at 172.16.16.125 at tcp) in 227 seconds. I think it's dead, and I am evicting it.
>>      
> There is a bug in lustre 1.8.2 and 1.8.3 that makes the ptlrpcd get
> stuck for long periods of time (around 10 minutes was the longest that I
> saw) on lustre clients under certain work loads.  If the ptlrpcd is
> dead, the client may stop sending all RPCs to the servers, and the
> servers evict the client because they haven't heard from it in a while.
>
> See bug 22897 for a description of the bug.  But the fix is a simple
> one-liner in bug 22786, attachment 29866.  The fix will first appear in
> lustre 1.8.4.  I would highly recommend to anyone using 1.8.2 or 1.8.3
> that they add that patch.
>
> I don't know if that is the cause of your particular evictions, because
> there can be many causes of evictions.  But the "haven't hear from
> client ... in 227 seconds" was one of the symptoms, and the client
> failing with -107 (ENOTCONN) with multiple OSTs (and/or MDS, MGS...) at
> the same time was another symptom.
>
> Chris
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>    


-- 
_Gabriele Paciucci_ http://www.linkedin.com/in/paciucci

Pursuant to legislative Decree n. 196/03 you are hereby informed that this email contains confidential information intended only for use of addressee. If you are not the addressee and have received this email by mistake, please send this email to the sender. You may not copy or disseminate this message to anyone. Thank You.




More information about the lustre-discuss mailing list