[lustre-devel] Regression with delayed_work in sec_gc

Sebastien Buisson sbuisson at ddn.com
Tue May 22 09:05:17 PDT 2018


> Le 21 mai 2018 à 03:22, NeilBrown <neilb at suse.com> a écrit :
> 
> On Thu, May 17 2018, Sebastien Buisson wrote:
> 
>>> Le 16 mai 2018 à 00:22, NeilBrown <neilb at suse.com> a écrit :
>>> 
>>> On Tue, May 15 2018, James Simmons wrote:
>>> 
>>>> Dmitry has backported your upstream commit
>>>> 
>>>> be3c64da248b1ada2bbcfc3855f07536d8087acb ( ptlrpc: use workqueue for
>>>> pinger)
>>>> 
>>>> and Sebastien tried out the change which can be seen at
>>>> https://review.whamcloud.com/#/c/31724. Sebastien is having issues with
>>>> unloading osc, mdc, ost or mdt kernel modules. He writes that he activated
>>>> Kerberos for my file system, did some IOs, then tried to unmount the
>>>> client and the  targets. Unfortunately, on some occasions, while
>>>> unmounting, the mdc module for instance has extra references (as shown by
>>>> 'lsmod'), and cannot be unloaded. After a while, the extra references are
>>>> dropped, and the  module can be unloaded. It also happens on occasions on
>>>> server side, for the mdt or ost modules. I tried destroy_workqueue but
>>>> that made things worst :-( Do you know what is going on?
>>> 
>>> My guess is that ptlrpc_pinger_main() is looping, so
>>> cancel_delayed_work_sync() is waiting an unusually long time for it.
>>> This patch might fix the problem - it certainly makes the usage
>>> of "this_ping" closer to the original code.
>>> 
>>> Thanks,
>>> NeilBrown
>>> 
>>> diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> index 89eef8ec7df4..131b201541e8 100644
>>> --- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> +++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> @@ -222,12 +222,13 @@ static DECLARE_DELAYED_WORK(ping_work, ptlrpc_pinger_main);
>>> 
>>> static void ptlrpc_pinger_main(struct work_struct *ws)
>>> {
>>> -	unsigned long this_ping = jiffies;
>>> 	long time_to_next_wake;
>>> 	struct timeout_item *item;
>>> 	struct obd_import *imp;
>>> 
>>> 	do {
>>> +		unsigned long this_ping = jiffies;
>>> +
>>> 		mutex_lock(&pinger_mutex);
>>> 		list_for_each_entry(item, &timeout_list, ti_chain) {
>>> 			item->ti_cb(item, item->ti_cb_data);
>> 
>> Still same problem when testing "ptlrpc: use delayed_work in sec_gc" with patch "ptlrpc: use workqueue for pinger" and modification as suggested above. Kernel module ptlrpc_gss holds extra references, and cannot be unloaded.
>> 
>> However, patch "ptlrpc: use workqueue for pinger" alone does not have the issue.
> 
> Thanks for the clear explanation - helps a lot.  I see the problem - I think.
> 
> When (e.g.) ctx_release_kr() is called as part of shutting down a security
> context, it calls sptlrpc_gc_add_ctx() which signals the gc-thread to
> re-run.
> None of this code is in mainline (why isn't the GSS code in
> mainline????)
> So I removed the SVC_SIGNAL flag as it was never being set.
> So I didn't add matching functionality when porting to delayed-work.
> 
> To port  "ptlrpc: use delayed_work in sec_gc" to upstream lustre you
> need to get sptlrpc_gc_add_ctx() to call
>  	mod_delayed_work(system_wq, &sec_gc_work, 0);
> 
> NeilBrown

Excellent, no more pending extra references on kernel modules when calling mod_delayed_work() from sptlrpc_gc_add_ctx(), thanks!

I have updated patch at https://review.whamcloud.com/31724 to include this modification.

Cheers,
Sebastien.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180522/11717952/attachment.sig>


More information about the lustre-devel mailing list