[lustre-discuss] MDT deadlocks LU-10697

Thomas Roth t.roth at gsi.de
Wed Nov 13 03:28:42 PST 2019


Hi all,

we keep hitting LU-10697, which makes the users' experience quite painful.
There was a related issue in Lustre 2.12/2.13 which is also unresolved - can't find the LU- at the moment.

In any case, it always looks like

  Nov 13 10:23:58 lxmds19.gsi.de kernel: Pid: 6449, comm: mdt00_095 3.10.0-957.el7_lustre.x86_64 #1 
SMP Wed Dec 12 15:03:08 UTC 2018
Nov 13 10:23:58 lxmds19.gsi.de kernel: Call Trace:
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff87786cf7>] call_rwsem_down_write_failed+0x17/0x30
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc1716f04>] lod_qos_prep_create+0xaa4/0x17f0 [lod]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc171818d>] lod_prepare_create+0x25d/0x360 [lod]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc170c9ae>] lod_declare_striped_create+0x1ee/0x970 [lod]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc170ee24>] lod_declare_create+0x1e4/0x540 [lod]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc177ab22>] 
mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc176c1a3>] mdd_declare_create+0x53/0xe30 [mdd]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc1770059>] mdd_create+0x879/0x1400 [mdd]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc166acc5>] mdt_reint_open+0x2175/0x3190 [mdt]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc165fb43>] mdt_reint_rec+0x83/0x210 [mdt]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc164137b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc16418a2>] mdt_intent_reint+0x162/0x430 [mdt]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc164c681>] mdt_intent_policy+0x441/0xc70 [mdt]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0f5d2ba>] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0f86b53>] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc100c4f2>] tgt_enqueue+0x62/0x210 [ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc101042a>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0fb8e5b>] ptlrpc_server_handle_request+0x23b/0xaa0 
[ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffc0fbc5a2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff874c1c31>] kthread+0xd1/0xe0
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffff87b74c37>] ret_from_fork_nospec_end+0x0/0x39
Nov 13 10:23:58 lxmds19.gsi.de kernel:  [<ffffffffffffffff>] 0xffffffffffffffff


and at some point the MDS gives up

Nov 13 11:34:34 lxmds19.gsi.de kernel: LustreError: 
6433:0:(ldlm_request.c:130:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1573640974, 
300s ago); not entering recovery in server code, just going back to sleep ns: mdt-hebe-MDT0000_UUID 
lock: ffff996f423ad800/0xd20e202c72a0f5f4 lrc: 3/1,0 mode: --/PR res: [0x20000c8f0:0x3ad6:0x0].0x0 
bits 0x13 rrc: 24 type: IBT flags: 0x40210000000000 nid: local remote: 0x0 expref: -99 pid: 6433 
timeout: 0 lvb_type: 0



Is there a chance that these issues are repaired in 2.12/2.13?
There seems to be no activity at the moment in LU-10697, which is anyhow from last year.
The jira ticket that I can't find anymore, reporting similar issues in 2.12, is from 2019 at least.


Regards,
Thomas


-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Georg Schütte


More information about the lustre-discuss mailing list