[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.
Simon Kelley
simon at thekelleys.org.uk
Tue Feb 10 09:11:52 PST 2009
Oleg Drokin wrote:
>
> What would be useful here is if you can enable dlm tracing (echo
> +dlm_trace >/proc/sys/lnet/debug)
> on some of those 1.6 nodes (also if you are running with no debug
> enabled at all,
> also enable rpc_trace and info levels) and also enable "dump on
> eviction" feature.
> (echo 1 >/proc/sys/lustre/dump_on_eviction).
> Then when next eviction happens, there would be some useful debug data
> dumped on the client,
> that you can attach to a bugzilla bug along with server-side eviction
> message (processed
> with "lctl dl" command first).
OK, will do. The main problem is reproducing the error: our users have
unreasonably insisted that we run their jobs using known-good 1.4
clients and even if I grab their code to run on isolated test nodes
_most_ runs are fine.
>
>> We are also seeing some userspace file operations fail with the error
>> "No locks available". These don't generate any logging on the client
>> so
>> I don't have exact timing. It's possible that they are associated with
>> further "### lock callback timer expired" server logs.
>
> This error code typically means an application attempting to do some i/
> o and Lustre
> has no lock for the i/o area for some reason anymore (it is normally
> obtained
> once read or write path is entered), and that could be related to
> evictions too
> (locks are revoked at eviction time).
I should have mentioned that we are also seeing many errors of the form
"LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel()) received
cancel for unknown lock cookie." Checking back, these would seem to
pre-date the introduction of 1.6 clients and even after we upgraded
clients I can see them associated with both 1.4 and 1.6 clients. They
may indicate something else relevant about the filesystems or workload.
Cheers,
Simon.
More information about the lustre-discuss
mailing list