[lustre-devel] [PATCH v2] lustre: mdc: fix possible deadlock in chlg_open()

NeilBrown neilb at suse.com
Tue Nov 6 16:43:02 PST 2018


On Wed, Nov 07 2018, NeilBrown wrote:

> On Tue, Nov 06 2018, quentin.bouget at cea.fr wrote:
>
>> Le 06/11/2018 à 04:11, NeilBrown a écrit :
>>> On Mon, Nov 05 2018, quentin.bouget at cea.fr wrote:
>>>
>>>> Le 04/11/2018 à 22:34, James Simmons a écrit :
>>>>>> Lockdep reports a possible deadlock between chlg_open() and
>>>>>> mdc_changelog_cdev_init()
>>>>>>
>>>>>> mdc_changelog_cdev_init() takes chlg_registered_dev_lock and then
>>>>>> calls misc_register() which takes misc_mtx.
>>>>>> chlg_open() is called while misc_mtx is held, and tries to take
>>>>>> chlg_registered_dev_lock.
>>>>>> If these two functions race, a deadlock can occur as each thread will
>>>>>> hold one of the locks while trying to take the other.
>>>>>>
>>>>>> chlg_open() does not need to take a lock.  It only uses the
>>>>>> lock to stablize a list while looking for the matching
>>>>>> chlg_registered_dev, and this can be found directly by examining
>>>>>> file->private_data.
>>>>>>
>>>>>> So remove chlg_obd_get(), and use file->private_data to find the
>>>>>> obd_device.
>>>>>> Also ensure the device is fully initialized before calling
>>>>>> misc_register().  This means setting up some list linkage before the
>>>>>> call, and tearing it down if there is an error.
>>>>> I have been testing this but I'm no HSM expert. I pushed this patch
>>>>> to OpenSFS branch as well.
>>>>>
>>>>> https://jira.whamcloud.com/browse/LU-11617
>>>>> https://review.whamcloud.com/#/c/33572/
>>>>>
>>>>> Reviewed-by: James Simmons <jsimmons at infradead.org>
>>>> Reviewed-by: Quentin Bouget <quentin.bouget at cea.fr>
>>>>
>>> Thanks to you both for the review.
>>>
>>> NeilBrown
>>>
>> Wait! I just realised there might be another issue!
>> I think there is now a race between chlg_open() and 
>> mdc_changelog_cdev_finish().
>
> Hmmm.. yes.  Also another deadlock possibility as the locks can be taken
> in the wrong order here too.

Sorry, I got that wrong - there is no other deadlock.
mdc_changelog_cdev_finish() can call misc_deregister() while holding
chlg_registered_dev_lock, and misc_deregister() will take misc_mtx, but
that is the right way around, not deadlock-causing.

NeilBrown

>
>>
>> Wait! I just realised there might be another bigger issue!
>> The whole "take the first obd you can find" is broken! I opened a ticket 
>> <https://jira.whamcloud.com/browse/LU-11626> on whamcloud's JIRA about it.
>
> Yep...
> My guess is that chlg_load(), chlg_clear() and (possibly)
> chlg_read_cat_process_cb() should take the mutex, choose an obd, used
> it, and release the mutex.
>
> I might post an RFC patch.
>
> NeilBrown
>
>
>>
>> Quentin Bouget
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20181107/fe44f28c/attachment.sig>


More information about the lustre-devel mailing list