[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Simon Kelley simon at thekelleys.org.uk
Tue Feb 10 09:46:01 PST 2009


Oleg Drokin wrote:
> Hello!
> 
> On Feb 10, 2009, at 12:11 PM, Simon Kelley wrote:
>>>> We are also seeing some userspace file operations fail with the  
>>>> error
>>>> "No locks available". These don't generate any logging on the  
>>>> client  so
>>>> I don't have exact timing. It's possible that they are associated  
>>>> with
>>>> further "### lock callback timer expired" server logs.
>>> This error code typically means an application attempting to do  
>>> some i/ o and Lustre
>>> has no lock for the i/o area for some reason anymore (it is  
>>> normally  obtained
>>> once read or write path is entered), and that could be related to   
>>> evictions too
>>> (locks are revoked at eviction time).
>> I should have mentioned that we are also seeing many errors of the  
>> form "LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel())  
>> received cancel for unknown lock cookie." Checking back, these would  
>> seem to pre-date the introduction of 1.6 clients and even after we  
>> upgraded clients I can see them associated with both 1.4 and 1.6  
>> clients. They may indicate something else relevant about the  
>> filesystems or workload.
> 
> Hm, that means clients hold some locks that server does not believe  
> thy have, which is pretty strange.
> Or it just does not recognize the lock released by client and later  
> releases the client.
> If you have a complete kernel log of the event, that might be useful  
> to see the sequence of events.

If, by "the complete event" you mean the "received cancel for unknown 
cookie", there's not much more to tell. Grepping through the last 
month's server logs shows that there are bursts of typically between 3 
and 7 messages, at the same time and from the same client. After a gap, 
the same thing but from a different client. The number can be as low a 
one, and up to ten. They look to be related to client workload, at a guess.

Picking a few events and looking at the client logfile for the same time 
gives absolutely nothing at all.

> I assume you do not have flaky network and your clients do not  
> reconnect all the time to the servers
> with messages in logs like 'changed handle from X to Y; copying, but  
> this may foreshadow disaster',
> which would be a different bug no longer present in 1.6.6, too.
> 

None of that applies.


Simon.




More information about the lustre-discuss mailing list