[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Tue Feb 10 09:11:52 PST 2009

Oleg Drokin wrote:
>
> What would be useful here is if you can enable dlm tracing (echo  
> +dlm_trace >/proc/sys/lnet/debug)
> on some of those 1.6 nodes (also if you are running with no debug  
> enabled at all,
> also enable rpc_trace and info levels) and also enable "dump on  
> eviction" feature.
> (echo 1 >/proc/sys/lustre/dump_on_eviction).
> Then when next eviction happens, there would be some useful debug data  
> dumped on the client,
> that you can attach to a bugzilla bug along with server-side eviction  
> message (processed
> with "lctl dl" command first).

OK, will do. The main problem is reproducing the error: our users have 
unreasonably insisted that we run their jobs using known-good 1.4 
clients and even if I grab their code to run on isolated test nodes 
_most_ runs are fine.

> 
>> We are also seeing some userspace file operations fail with the error
>> "No locks available". These don't generate any logging on the client  
>> so
>> I don't have exact timing. It's possible that they are associated with
>> further "### lock callback timer expired" server logs.
> 
> This error code typically means an application attempting to do some i/ 
> o and Lustre
> has no lock for the i/o area for some reason anymore (it is normally  
> obtained
> once read or write path is entered), and that could be related to  
> evictions too
> (locks are revoked at eviction time).

I should have mentioned that we are also seeing many errors of the form 
"LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel()) received 
cancel for unknown lock cookie." Checking back, these would seem to 
pre-date the introduction of 1.6 clients and even after we upgraded 
clients I can see them associated with both 1.4 and 1.6 clients. They 
may indicate something else relevant about the filesystems or workload.

Cheers,

Simon.