[lustre-discuss] MDT deadlocks LU-10697

Thomas Roth t.roth at gsi.de
Wed Nov 13 13:06:38 PST 2019


Indeed, that's the one I was looking for.
Interesting: why would Google give me different Jira-hits when searching for the same term, from two different by close-by internet-entry points in 
Germany, from two Firefoxes of the same version, running on Debian or Ubuntu, no JavaScript, no cookies in both browsers ... ?
Because here at home I find LU-12136 immediately ;-)

Anyhow, thanks Nathan.
So, LU-12018 could be covered by our planned upgrade to 2.12, very good.

Regards,
Thomas

On 13.11.19 17:24, Nathan Dauchy - NOAA Affiliate wrote:
> On Wed, Nov 13, 2019 at 4:28 AM Thomas Roth <t.roth at gsi.de> wrote:
> 
>> Hi all,
>>
>> we keep hitting LU-10697, which makes the users' experience quite painful.
>> There was a related issue in Lustre 2.12/2.13 which is also unresolved -
>> can't find the LU- at the moment.
>>
>>
> Thomas,
> 
> Perhaps you are looking for LU-12136?  Also in that ticket, LU-12018 is
> referenced which has a patch that _may_ reduce the likelihood of hitting
> the problem.
> 
> Regards,
> Nathan
> 
> 
> In any case, it always looks like
>>
>>    Nov 13 10:23:58 lxmds19.gsi.de kernel: Pid: 6449, comm: mdt00_095
>> 3.10.0-957.el7_lustre.x86_64 #1
>> SMP Wed Dec 12 15:03:08 UTC 2018
>> Nov 13 10:23:58 lxmds19.gsi.de kernel: Call Trace:
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff87786cf7>]
>> call_rwsem_down_write_failed+0x17/0x30
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc1716f04>]
>> lod_qos_prep_create+0xaa4/0x17f0 [lod]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc171818d>]
>> lod_prepare_create+0x25d/0x360 [lod]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc170c9ae>]
>> lod_declare_striped_create+0x1ee/0x970 [lod]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc170ee24>]
>> lod_declare_create+0x1e4/0x540 [lod]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc177ab22>]
>> mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc176c1a3>]
>> mdd_declare_create+0x53/0xe30 [mdd]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc1770059>]
>> mdd_create+0x879/0x1400 [mdd]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc166acc5>]
>> mdt_reint_open+0x2175/0x3190 [mdt]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc165fb43>]
>> mdt_reint_rec+0x83/0x210 [mdt]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc164137b>]
>> mdt_reint_internal+0x5fb/0x9c0 [mdt]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc16418a2>]
>> mdt_intent_reint+0x162/0x430 [mdt]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc164c681>]
>> mdt_intent_policy+0x441/0xc70 [mdt]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0f5d2ba>]
>> ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0f86b53>]
>> ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc100c4f2>]
>> tgt_enqueue+0x62/0x210 [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc101042a>]
>> tgt_request_handle+0x92a/0x1370 [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0fb8e5b>]
>> ptlrpc_server_handle_request+0x23b/0xaa0
>> [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0fbc5a2>]
>> ptlrpc_main+0xa92/0x1e40 [ptlrpc]
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff874c1c31>]
>> kthread+0xd1/0xe0
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff87b74c37>]
>> ret_from_fork_nospec_end+0x0/0x39
>> Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffffffffff>]
>> 0xffffffffffffffff
>>
>>
>> and at some point the MDS gives up
>>
>> Nov 13 11:34:34 lxmds19.gsi.de kernel: LustreError:
>> 6433:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed
>> out (enqueued at 1573640974,
>> 300s ago); not entering recovery in server code, just going back to sleep
>> ns: mdt-hebe-MDT0000_UUID
>> lock: ffff996f423ad800/0xd20e202c72a0f5f4 lrc: 3/1,0 mode: --/PR res:
>> [0x20000c8f0:0x3ad6:0x0].0x0
>> bits 0x13 rrc: 24 type: IBT flags: 0x40210000000000 nid: local remote: 0x0
>> expref: -99 pid: 6433
>> timeout: 0 lvb_type: 0
>>
>>
>>
>> Is there a chance that these issues are repaired in 2.12/2.13?
>> There seems to be no activity at the moment in LU-10697, which is anyhow
>> from last year.
>> The jira ticket that I can't find anymore, reporting similar issues in
>> 2.12, is from 2019 at least.
>>
>>
>> Regards,
>> Thomas
>>
>>
>> --
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 2.291
>> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
>>
>>
>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de
>>
>> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
>> Managing Directors / Geschäftsführung:
>> Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock
>> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
>> State Secretary / Staatssekretär Dr. Georg Schütte
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
> 

-- 
--------------------------------------------------------------------
Thomas Roth
Department: HPC
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1
64291 Darmstadt
www.gsi.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Geschäftsführung: Professor Dr. Paolo Giubellino
Ursula Weyrich
Jörg Blaurock

Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt


More information about the lustre-discuss mailing list