[lustre-devel] Regression with delayed_work in sec_gc
Sebastien Buisson
sbuisson at ddn.com
Tue May 22 09:05:17 PDT 2018
> Le 21 mai 2018 à 03:22, NeilBrown <neilb at suse.com> a écrit :
>
> On Thu, May 17 2018, Sebastien Buisson wrote:
>
>>> Le 16 mai 2018 à 00:22, NeilBrown <neilb at suse.com> a écrit :
>>>
>>> On Tue, May 15 2018, James Simmons wrote:
>>>
>>>> Dmitry has backported your upstream commit
>>>>
>>>> be3c64da248b1ada2bbcfc3855f07536d8087acb ( ptlrpc: use workqueue for
>>>> pinger)
>>>>
>>>> and Sebastien tried out the change which can be seen at
>>>> https://review.whamcloud.com/#/c/31724. Sebastien is having issues with
>>>> unloading osc, mdc, ost or mdt kernel modules. He writes that he activated
>>>> Kerberos for my file system, did some IOs, then tried to unmount the
>>>> client and the targets. Unfortunately, on some occasions, while
>>>> unmounting, the mdc module for instance has extra references (as shown by
>>>> 'lsmod'), and cannot be unloaded. After a while, the extra references are
>>>> dropped, and the module can be unloaded. It also happens on occasions on
>>>> server side, for the mdt or ost modules. I tried destroy_workqueue but
>>>> that made things worst :-( Do you know what is going on?
>>>
>>> My guess is that ptlrpc_pinger_main() is looping, so
>>> cancel_delayed_work_sync() is waiting an unusually long time for it.
>>> This patch might fix the problem - it certainly makes the usage
>>> of "this_ping" closer to the original code.
>>>
>>> Thanks,
>>> NeilBrown
>>>
>>> diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> index 89eef8ec7df4..131b201541e8 100644
>>> --- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> +++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>>> @@ -222,12 +222,13 @@ static DECLARE_DELAYED_WORK(ping_work, ptlrpc_pinger_main);
>>>
>>> static void ptlrpc_pinger_main(struct work_struct *ws)
>>> {
>>> - unsigned long this_ping = jiffies;
>>> long time_to_next_wake;
>>> struct timeout_item *item;
>>> struct obd_import *imp;
>>>
>>> do {
>>> + unsigned long this_ping = jiffies;
>>> +
>>> mutex_lock(&pinger_mutex);
>>> list_for_each_entry(item, &timeout_list, ti_chain) {
>>> item->ti_cb(item, item->ti_cb_data);
>>
>> Still same problem when testing "ptlrpc: use delayed_work in sec_gc" with patch "ptlrpc: use workqueue for pinger" and modification as suggested above. Kernel module ptlrpc_gss holds extra references, and cannot be unloaded.
>>
>> However, patch "ptlrpc: use workqueue for pinger" alone does not have the issue.
>
> Thanks for the clear explanation - helps a lot. I see the problem - I think.
>
> When (e.g.) ctx_release_kr() is called as part of shutting down a security
> context, it calls sptlrpc_gc_add_ctx() which signals the gc-thread to
> re-run.
> None of this code is in mainline (why isn't the GSS code in
> mainline????)
> So I removed the SVC_SIGNAL flag as it was never being set.
> So I didn't add matching functionality when porting to delayed-work.
>
> To port "ptlrpc: use delayed_work in sec_gc" to upstream lustre you
> need to get sptlrpc_gc_add_ctx() to call
> mod_delayed_work(system_wq, &sec_gc_work, 0);
>
> NeilBrown
Excellent, no more pending extra references on kernel modules when calling mod_delayed_work() from sptlrpc_gc_add_ctx(), thanks!
I have updated patch at https://review.whamcloud.com/31724 to include this modification.
Cheers,
Sebastien.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180522/11717952/attachment.sig>
More information about the lustre-devel
mailing list