[Lustre-devel] question about ldlm_server_glimpse_ast
jfilizetti at sms-fed.com
Thu Apr 29 19:59:42 PDT 2010
In our Lustre WAN environment a few times we've had a link drop for an
extended period of time which causes problems on systems accessing data in
the same directory as the remote system that becomes unavailable. Our OSS's
seem to be stuck in a loop of ptlrpc_queue_wait called from
ldlm_server_glimpse_ast. The remote site is accesed through an LNet router
which is still available. However the OSS resends requests every 7 seconds
successfully to the router but squbsequently with timeout which causes it to
loop in ptlrpc_queue_wait.
Looking over the ldlm_server_blocking_ast and ldlm_server_completion_ast
functions I see they set rq_no_resend = 1, but ldlm_server_glimpse_ast does
not. I'm not familiar with the locking in Lustre, is there a reason that
ldlm_server_glimpse_ast doesn't set rq_no_resend = 1? This would get rid of
the loop ptlrpc_queue_wait is stuck in until the client comes back, but I'm
not sure if it would have other unexpected consequences.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-devel