[lustre-devel] Regression with delayed_work in sec_gc
NeilBrown
neilb at suse.com
Sun May 20 18:22:44 PDT 2018
On Thu, May 17 2018, Sebastien Buisson wrote:
>> Le 16 mai 2018 à 00:22, NeilBrown <neilb at suse.com> a écrit :
>>
>> On Tue, May 15 2018, James Simmons wrote:
>>
>>> Dmitry has backported your upstream commit
>>>
>>> be3c64da248b1ada2bbcfc3855f07536d8087acb ( ptlrpc: use workqueue for
>>> pinger)
>>>
>>> and Sebastien tried out the change which can be seen at
>>> https://review.whamcloud.com/#/c/31724. Sebastien is having issues with
>>> unloading osc, mdc, ost or mdt kernel modules. He writes that he activated
>>> Kerberos for my file system, did some IOs, then tried to unmount the
>>> client and the targets. Unfortunately, on some occasions, while
>>> unmounting, the mdc module for instance has extra references (as shown by
>>> 'lsmod'), and cannot be unloaded. After a while, the extra references are
>>> dropped, and the module can be unloaded. It also happens on occasions on
>>> server side, for the mdt or ost modules. I tried destroy_workqueue but
>>> that made things worst :-( Do you know what is going on?
>>
>> My guess is that ptlrpc_pinger_main() is looping, so
>> cancel_delayed_work_sync() is waiting an unusually long time for it.
>> This patch might fix the problem - it certainly makes the usage
>> of "this_ping" closer to the original code.
>>
>> Thanks,
>> NeilBrown
>>
>> diff --git a/drivers/staging/lustre/lustre/ptlrpc/pinger.c b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>> index 89eef8ec7df4..131b201541e8 100644
>> --- a/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>> +++ b/drivers/staging/lustre/lustre/ptlrpc/pinger.c
>> @@ -222,12 +222,13 @@ static DECLARE_DELAYED_WORK(ping_work, ptlrpc_pinger_main);
>>
>> static void ptlrpc_pinger_main(struct work_struct *ws)
>> {
>> - unsigned long this_ping = jiffies;
>> long time_to_next_wake;
>> struct timeout_item *item;
>> struct obd_import *imp;
>>
>> do {
>> + unsigned long this_ping = jiffies;
>> +
>> mutex_lock(&pinger_mutex);
>> list_for_each_entry(item, &timeout_list, ti_chain) {
>> item->ti_cb(item, item->ti_cb_data);
>
> Still same problem when testing "ptlrpc: use delayed_work in sec_gc" with patch "ptlrpc: use workqueue for pinger" and modification as suggested above. Kernel module ptlrpc_gss holds extra references, and cannot be unloaded.
>
> However, patch "ptlrpc: use workqueue for pinger" alone does not have the issue.
Thanks for the clear explanation - helps a lot. I see the problem - I think.
When (e.g.) ctx_release_kr() is called as part of shutting down a security
context, it calls sptlrpc_gc_add_ctx() which signals the gc-thread to
re-run.
None of this code is in mainline (why isn't the GSS code in
mainline????)
So I removed the SVC_SIGNAL flag as it was never being set.
So I didn't add matching functionality when porting to delayed-work.
To port "ptlrpc: use delayed_work in sec_gc" to upstream lustre you
need to get sptlrpc_gc_add_ctx() to call
mod_delayed_work(system_wq, &sec_gc_work, 0);
NeilBrown
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20180521/a1104c70/attachment.sig>
More information about the lustre-devel
mailing list