[lustre-discuss] Journaling deadlock pattern on MDS for Lustre 2.15.5, 2.15.6 and 2.15.7 with RHEL 8.10 (ldiskfs setup)

Tue Aug 5 03:34:07 PDT 2025

Hi,

We would like to inform about (what we believe is) a journaling deadlock
pattern in our MDS setup that is crippling our production. We have seen
this on different hardware (Dell and Supermicro servers) with different
storage controllers that have been thoroughly checked hardware/firmware
wise. We have reproduced it consistently with the following software stack
combinations:

Kernels:
4.18.0-553.5.1.el8_lustre.x86_64, 4.18.0-553.27.1.el8_lustre.x86_64
and 4.18.0-553.53.1.el8_lustre.x86_64
Lustre versions: 2.15.5, 2.15.6 and 2.15.7 (the bug exists in all versions)
DRBD kmod: 9.1.23

What we are hoping to achieve with this message is:
-To see and inform if someone has seen this bug with similar software
stacks.
-To examine if our plan to upgrade the MDS server to RHEL 9 and 2.16.1
(5.14.0-427.31.1_lustre.el9.x86_64 with the latest kmod-drbd: 9.2.14) is
going to fix the issue

We have also reported the issue to RedHat (case number 04215952) and
they/we believe that

A bit more about our setup and the problem with some vmcore analysis that
we did with the RedHat storage and kernel engineers:

Our system is ldiskfs based and we are using DRBD to replicate the metadata
targets. We have done that successfully for a number of years. Here is how
our DRBD setup looks like:

[root at lustrea-mds-nx10077836-an ~]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0 446.6G  0 disk
├─sda1            8:1    0   600M  0 part /boot/efi
├─sda2            8:2    0     4G  0 part /boot
└─sda3            8:3    0   249G  0 part
  ├─rhel-root   253:0    0   150G  0 lvm  /
  ├─rhel-swap   253:1    0    20G  0 lvm  [SWAP]
  ├─rhel-backup 253:2    0    47G  0 lvm  /backup
  └─rhel-var    253:3    0    32G  0 lvm  /var
sdb               8:16   0  17.5T  0 disk
├─sdb1            8:17   0   1.8T  0 part
│ └─drbd0       147:0    0   1.8T  0 disk /lustre/mgs
├─sdb2            8:18   0   3.7T  0 part
│ └─drbd1       147:1    0   3.7T  0 disk /lustre/mdt0
└─sdb3            8:19   0   3.7T  0 part
  └─drbd2       147:2    0   3.7T  0 disk /lustre/mdt1

In terms of the problem, what happens is that either I/O is blocked on
drbd1 (mdt0) or drbd2 (mdt1) with messages like this:
Jul 17 07:24:19 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
ldlm_cn19_000:17761 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
mdt21_011:19591 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
mdt21_010:19584 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
mdt21_009:19574 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
mdt21_008:19564 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
jbd2/drbd1-8:17843 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
ldlm_cn19_002:17763 blocked for more than 120 seconds.
Jul 17 07:22:16 lustrea-mds-nx10077836-an.int.met.no kernel: INFO: task
ldlm_cn19_000:17761 blocked for more than 120 seconds.

At that point, the DRBD processes remain blocked, until we powercycle the
server. It is not possible to cleanly unmount the metadata target of the
blocked drbd device. We can umount the mgs (drbd0) and the other drbd
device OK (only one blocks at the time). We need to powercycle the server
causing production outages, fsck ops that create serious problems to our
operational environment.  The frequency of the problem occurrence varies.
It could be anything from 12 hours to several days for some filesystems and
over a month for other filesystems (with the same software stack).

We (together with RedHat kernel engineers) do not believe DRBD is to blame
here. In fact, we have gone as far as to run the active MDS in standalone
mode (no DRBD replication) and we still have triggered the deadlock. We
suspect something happens in:
qsd_reserve_or_free_quota
 ↳ osd_reserve_or_free_quota
 ↳ ldiskfs_sync_fs
 ↳ jbd2_transaction_committed

Thanks for any hints you can provide!

A little bit of kdump/vmcore analysis reveals the deadlock between jbd2 and
the way lustre is using it. What we see in the ldiskfs layer is a class of
threads (MDT and LDLM?) that are stuck. Every one of them lands up blocking
here when ldiskfs_sync_fs is called. See below:

-- 
--crash> ps -m | grep UN
[0 00:19:19.605] [UN]  PID: 464849   TASK: ff4103d58655c000  CPU: 38
COMMAND: "ldlm_cn19_006"
[0 00:19:56.741] [UN]  PID: 55781    TASK: ff4103b341b68000  CPU: 10
COMMAND: "mdt05_001"
[0 00:21:32.753] [UN]  PID: 125286   TASK: ff4103c86207c000  CPU: 74
COMMAND: "mdt13_009"
[0 00:22:56.640] [UN]  PID: 55720    TASK: ff4103b2fd31c000  CPU: 38
COMMAND: "ldlm_cn19_000"
[0 00:26:43.200] [UN]  PID: 57585    TASK: ff4103b373fcc000  CPU: 40
COMMAND: "ldlm_cn20_004"
[0 00:27:22.124] [UN]  PID: 56529    TASK: ff4103b427844000  CPU: 4
 COMMAND: "jbd2/drbd2-8"
[0 00:27:23.085] [UN]  PID: 57687    TASK: ff4103b29eae0000  CPU: 37
COMMAND: "mdt18_012"
[0 00:27:25.832] [UN]  PID: 57639    TASK: ff4103b3d2318000  CPU: 77
COMMAND: "mdt14_013"
[0 00:27:26.014] [UN]  PID: 57703    TASK: ff4103b2ddb58000  CPU: 84
COMMAND: "mdt18_013"

crash> set ff4103b2ddb58000
    PID: 57703
COMMAND: "mdt18_013"
   TASK: ff4103b2ddb58000  [THREAD_INFO: ff4103b2ddb58000]
    CPU: 84
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"
 #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01
 #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5
 #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
[jbd2]
 #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
 #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]
 #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
 #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
 #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
 #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
[ptlrpc]
 #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24
#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f
crash> whatis __pfx_jbd2_transaction_committed
<text variable, no debug info> __pfx_jbd2_transaction_committed;
crash> bt
PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"
 #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01
 #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5
 #2 [ff75fb8e88fe7c30] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]
 #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
 #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]
 #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
 #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
 #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
 #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
[ptlrpc]
 #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24
#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f
crash> sym jbd2_log_wait_commit
ffffffffc02f3b90 (T) jbd2_log_wait_commit [jbd2]
/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/fs/jbd2/journal.c:
681

Now the backtrace looks much better:

crash> dis -r ffffffffc02f3c3c | tail -10
0xffffffffc02f3c1e <jbd2_log_wait_commit+0x8e>: mov    %r12d,%eax
0xffffffffc02f3c21 <jbd2_log_wait_commit+0x91>: sub    0x3c0(%rbx),%eax
0xffffffffc02f3c27 <jbd2_log_wait_commit+0x97>: test   %eax,%eax
0xffffffffc02f3c29 <jbd2_log_wait_commit+0x99>: jle    0xffffffffc02f3be8
<jbd2_log_wait_commit+0x58>
0xffffffffc02f3c2b <jbd2_log_wait_commit+0x9b>: xor    %esi,%esi
0xffffffffc02f3c2d <jbd2_log_wait_commit+0x9d>: mov    %rsp,%rdi
0xffffffffc02f3c30 <jbd2_log_wait_commit+0xa0>: call   0xffffffff87f4e510
<init_wait_entry>
0xffffffffc02f3c35 <jbd2_log_wait_commit+0xa5>: jmp    0xffffffffc02f3c3c
<jbd2_log_wait_commit+0xac>
0xffffffffc02f3c37 <jbd2_log_wait_commit+0xa7>: call   0xffffffff8881e2a0
<schedule>
0xffffffffc02f3c3c <jbd2_log_wait_commit+0xac>: mov    $0x2,%edx
crash>
crash> dis -r ffffffffc16174f5 | tail
0xffffffffc16174d6 <ldiskfs_sync_file+0x346>: jmp    0xffffffffc16173f3
<ldiskfs_sync_file+0x263>
0xffffffffc16174db <ldiskfs_sync_file+0x34b>: mov    %r13d,%esi
0xffffffffc16174de <ldiskfs_sync_file+0x34e>: mov    %r15,%rdi
0xffffffffc16174e1 <ldiskfs_sync_file+0x351>: call   0xffffffffc02f41a0
<jbd2_trans_will_send_data_barrier>
0xffffffffc16174e6 <ldiskfs_sync_file+0x356>: mov    %r13d,%esi
0xffffffffc16174e9 <ldiskfs_sync_file+0x359>: mov    %r15,%rdi
0xffffffffc16174ec <ldiskfs_sync_file+0x35c>: test   %eax,%eax
0xffffffffc16174ee <ldiskfs_sync_file+0x35e>: je     0xffffffffc16174fd
<ldiskfs_sync_file+0x36d>
0xffffffffc16174f0 <ldiskfs_sync_file+0x360>: call   0xffffffffc02f5a40
<jbd2_complete_transaction>
0xffffffffc16174f5 <ldiskfs_sync_file+0x365>: mov    %eax,%r12d
crash> dis -r ffffffffc02f3c3c | head
0xffffffffc02f3b90 <jbd2_log_wait_commit>: nopl   0x0(%rax,%rax,1) [FTRACE
NOP]
0xffffffffc02f3b95 <jbd2_log_wait_commit+0x5>: push   %r15
0xffffffffc02f3b97 <jbd2_log_wait_commit+0x7>: push   %r14
0xffffffffc02f3b99 <jbd2_log_wait_commit+0x9>: mov    $0xfffffe00,%r14d
0xffffffffc02f3b9f <jbd2_log_wait_commit+0xf>: push   %r13
0xffffffffc02f3ba1 <jbd2_log_wait_commit+0x11>: push   %r12
0xffffffffc02f3ba3 <jbd2_log_wait_commit+0x13>: mov    %esi,%r12d
0xffffffffc02f3ba6 <jbd2_log_wait_commit+0x16>: push   %rbp
0xffffffffc02f3ba7 <jbd2_log_wait_commit+0x17>: lea    0x44(%rdi),%rbp
0xffffffffc02f3bab <jbd2_log_wait_commit+0x1b>: push   %rbx

crash> journal_t.j_wait_commit ff4103b2f19c7800
  j_wait_commit = {
    lock = {
      {
        rlock = {
          raw_lock = {
            {
              val = {
                counter = 0x0
              },
              {
                locked = 0x0,
                pending = 0x0
              },
              {
                locked_pending = 0x0,
                tail = 0x0
              }
            }
          }
        }
      }
    },
    head = {
      next = 0xff4103b2f19c78c0,
      prev = 0xff4103b2f19c78c0
    }
  },

So the ldiskfs is using the jbd2 for journaling.

TID: 000000001c0d1446
crash> journal_t.j_commit_sequence ff4103b2f19c7800
  j_commit_sequence = 0x1c0d1445,

crash> waitq journal_t.j_wait_done_commit ff4103b2f19c7800
PID: 464849   TASK: ff4103d58655c000  CPU: 38   COMMAND: "ldlm_cn19_006"
PID: 55781    TASK: ff4103b341b68000  CPU: 10   COMMAND: "mdt05_001"
PID: 125286   TASK: ff4103c86207c000  CPU: 74   COMMAND: "mdt13_009"
PID: 55720    TASK: ff4103b2fd31c000  CPU: 38   COMMAND: "ldlm_cn19_000"
PID: 57585    TASK: ff4103b373fcc000  CPU: 40   COMMAND: "ldlm_cn20_004"
PID: 57687    TASK: ff4103b29eae0000  CPU: 37   COMMAND: "mdt18_012"
PID: 57639    TASK: ff4103b3d2318000  CPU: 77   COMMAND: "mdt14_013"
PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"
PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"

---
crash> bt ff4103b427844000
PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"
 #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01
 #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5
 #2 [ff75fb8e7c3339d8] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]
 #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
 #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]
 #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
 #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
 #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
[lquota]
 #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
[osd_ldiskfs]
 #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]
#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
[ldiskfs]
#12 [ff75fb8e7c333cf8] jbd2_journal_commit_transaction at ffffffffc02ef960
[jbd2]
#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]
#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24
#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

[293564.548588] Lustre: Skipped 1 previous similar message
[293564.548588] Pid: 57703, comm: mdt18_013
4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024
[293564.548591] Call Trace TBD:
[293564.548607] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
[293564.548615] [<0>] ldiskfs_sync_file+0x365/0x3b0 [ldiskfs]
[293564.548634] [<0>] osd_object_sync+0x188/0x1e0 [osd_ldiskfs]
[293564.548645] [<0>] mdt_object_sync.isra.69+0x7a/0x3b0 [mdt]
[293564.548664] [<0>] mdt_sync+0xf3/0x320 [mdt]
[293564.548681] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]
[293564.548733] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[293564.548780] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]
[293564.548823] [<0>] kthread+0x134/0x150
[293564.548826] [<0>] ret_from_fork+0x1f/0x40

[293564.548830] Pid: 57639, comm: mdt14_013
4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024
[293564.548835] Call Trace TBD:
[293564.548856] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]
[293564.548864] [<0>] ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]
[293564.548881] [<0>] osd_sync+0xdf/0x180 [osd_ldiskfs]
[293564.548888] INFO: task jbd2/drbd2-8:56529 blocked for more than 120
seconds.
[293564.548894]       Tainted: G           OE     -------- -  -
4.18.0-553.5.1.el8_lustre.x86_64 #1
[293564.549271] [<0>] qsd_acquire+0x46c/0xde0 [lquota]
[293564.550413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[293564.550412] [<0>] qsd_op_begin0+0x165/0x8a0 [lquota]

[293564.550642] task:jbd2/drbd2-8    state:D stack:0     pid:56529 ppid:2
   flags:0x80004080
[293564.550641] [<0>] qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]
[293564.550648] Call Trace:
[293564.550652]  __schedule+0x2d1/0x870
[293564.550652] [<0>] osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
[293564.550659]  ? __wake_up_common_lock+0x89/0xc0
[293564.550663]  schedule+0x55/0xf0
[293564.550663] [<0>] dt_reserve_or_free_quota.constprop.45+0x2c/0xd0 [mdd]
[293564.550667]  jbd2_log_wait_commit+0xac/0x120 [jbd2]
[293564.550675]  ? finish_wait+0x80/0x80
[293564.550675] [<0>] mdd_attr_set+0xa70/0x1020 [mdd]
[293564.550679]  ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]
[293564.550685] [<0>] mdt_reint_setattr+0xea8/0x1720 [mdt]
[293564.550695]  osd_sync+0xdf/0x180 [osd_ldiskfs]
[293564.550708]  qsd_acquire+0x46c/0xde0 [lquota]
[293564.550707] [<0>] mdt_reint_rec+0x11f/0x270 [mdt]
[293564.550723]  qsd_op_begin0+0x165/0x8a0 [lquota]
[293564.550725] [<0>] mdt_reint_internal+0x4d0/0x7e0 [mdt]
[293564.550733]  ? lfsck_key_init+0x1c/0x130 [lfsck]
[293564.550747]  ? srso_alias_return_thunk+0x5/0xfcdfd
[293564.550743] [<0>] mdt_reint+0x5d/0x110 [mdt]
[293564.550749]  ? kmem_cache_alloc_trace+0x142/0x280
[293564.550754]  qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]
[293564.550764]  osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]
[293564.550775]  ? lu_context_init+0xa5/0x1b0 [obdclass]
[293564.550760] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]
[293564.550807]  tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc]
[293564.550817] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[293564.550863]  osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs]
[293564.550875]  ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs]
[293564.550865] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]
[293564.550888]  jbd2_journal_commit_transaction+0x1690/0x19f0 [jbd2]
[293564.550897]  ? __try_to_del_timer_sync+0x5a/0xa0
[293564.550900]  kjournald2+0xbd/0x270 [jbd2]
[293564.550904]  ? finish_wait+0x80/0x80
[293564.550907]  ? commit_timeout+0x10/0x10 [jbd2]
[293564.550908] [<0>] kthread+0x134/0x150
[293564.550911]  kthread+0x134/0x150
[293564.550912] [<0>] ret_from_fork+0x1f/0x40
[293564.550915]  ? set_kthread_struct+0x50/0x50
[293564.550918]  ret_from_fork+0x1f/0x40

__pfx_jbd2_transaction_committed is bd2_transaction_committed

__pfx_jbd2_transaction_committed should map to jbd2_transaction_committed

PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"
 #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01
 #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5
 #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
[jbd2]
 #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
 #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]
 #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
 #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
 #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
[lquota]
 #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
[osd_ldiskfs]
 #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]
#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
[ldiskfs]
#12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at
ffffffffc02ef960 [jbd2]
#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]
#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24
#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

Focusing on osd_sync -> ldiskfs_sync_fs -> __pfx_jbd2_transaction_committed
to see why we stalled

/* Return 1 when transaction with given tid has already committed. */
int jbd2_transaction_committed(journal_t *journal, tid_t tid)
{
        int ret = 1;

        read_lock(&journal->j_state_lock);   ****************** Seems to be
a rwlock_t j_state_lock;
        if (journal->j_running_transaction &&
            journal->j_running_transaction->t_tid == tid)
                ret = 0;
        if (journal->j_committing_transaction &&
            journal->j_committing_transaction->t_tid == tid)
                ret = 0;
        read_unlock(&journal->j_state_lock);
        return ret;
}
EXPORT_SYMBOL(jbd2_transaction_committed);

What does not make sense is how we get to schedule() in each case, if its
jbd2_transaction_committed
then maybe here

0xffffffffc16174a5 <ldiskfs_sync_file+789>: mov    %r12,%rdi
0xffffffffc16174a8 <ldiskfs_sync_file+792>: mov    %r14,%r15
0xffffffffc16174ab <ldiskfs_sync_file+795>: call   0xffffffff881a4170
<sync_inode_metadata>
0xffffffffc16174b0 <ldiskfs_sync_file+800>: mov    %eax,%r13d
0xffffffffc16174b3 <ldiskfs_sync_file+803>: test   %eax,%eax
0xffffffffc16174b5 <ldiskfs_sync_file+805>: jne    0xffffffffc161752e
<ldiskfs_sync_file+926>
0xffffffffc16174b7 <ldiskfs_sync_file+807>: mov    -0x98(%r12),%rax
0xffffffffc16174bf <ldiskfs_sync_file+815>: movabs $0x10000000000,%rsi
0xffffffffc16174c9 <ldiskfs_sync_file+825>: test   %rsi,%rax
0xffffffffc16174cc <ldiskfs_sync_file+828>: jne    0xffffffffc161746a
<ldiskfs_sync_file+730>
0xffffffffc16174ce <ldiskfs_sync_file+830>: mov    %r15,%rdi
0xffffffffc16174d1 <ldiskfs_sync_file+833>: call   0xffffffff8818a140 <dput>
0xffffffffc16174d6 <ldiskfs_sync_file+838>: jmp    0xffffffffc16173f3
<ldiskfs_sync_file+611>
0xffffffffc16174db <ldiskfs_sync_file+843>: mov    %r13d,%esi
0xffffffffc16174de <ldiskfs_sync_file+846>: mov    %r15,%rdi
0xffffffffc16174e1 <ldiskfs_sync_file+849>: call   0xffffffffc02f41a0
<jbd2_trans_will_send_data_barrier>
0xffffffffc16174e6 <ldiskfs_sync_file+854>: mov    %r13d,%esi
0xffffffffc16174e9 <ldiskfs_sync_file+857>: mov    %r15,%rdi
0xffffffffc16174ec <ldiskfs_sync_file+860>: test   %eax,%eax
0xffffffffc16174ee <ldiskfs_sync_file+862>: je     0xffffffffc16174fd
<ldiskfs_sync_file+877>
0xffffffffc16174f0 <ldiskfs_sync_file+864>: call   0xffffffffc02f5a40
<__pfx_jbd2_journal_start_commit>
0xffffffffc16174f5 <ldiskfs_sync_file+869>: mov    %eax,%r12d

called here
0xffffffffc16174f0 <ldiskfs_sync_file+864>: call   0xffffffffc02f5a40
<__pfx_jbd2_journal_start_commit>
0xffffffffc16174f5 <ldiskfs_sync_file+869>: mov    %eax,%r12d
0xffffffffc16174f8 <ldiskfs_sync_file+872>: jmp    0xffffffffc16172e7
<ldiskfs_sync_file+343>
0xffffffffc16174fd <ldiskfs_sync_file+877>: call   0xffffffffc02f5a40
<__pfx_jbd2_journal_start_commit>
0xffffffffc1617502 <ldiskfs_sync_file+882>: mov    %eax,%r13d

So stuck in this loop

xffffffffc02f3c20 <__pfx_jbd2_transaction_committed>: loopne
0xffffffffc02f3c4d <__pfx_jbd2_transaction_committed+45>
0xffffffffc02f3c22 <__pfx_jbd2_transaction_committed+2>: add    $0x3,%eax
0xffffffffc02f3c25 <__pfx_jbd2_transaction_committed+5>: add    %al,(%rax)
0xffffffffc02f3c27 <__pfx_jbd2_transaction_committed+7>: test   %eax,%eax

rest of the data
----------------
crash> bt -l
PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"
 #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c:
3804
 #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h:
15
 #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
[jbd2]

/usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
462
 #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/super.c: 5310
 #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]

/usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
2497
 #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_internal.h:
217
 #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c:
751
 #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5
[lquota]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c:
1358
 #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86
[osd_ldiskfs]

/usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/lustre_quota.h: 276
 #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]
#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]

/usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
1795
#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85
[ldiskfs]

/var/lib/jenkins/workspace/lustre-b2_15/arch/x86_64/build_type/server/distro/el8.10/ib_stack/inkernel/BUILD/reused/usr/src/kernels/4.18.0-553.5.1.el8_lustre.x86_64/include/linux/spinlock.h:
350
#12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at
ffffffffc02ef960 [jbd2]

/usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/fs/jbd2/commit.c:
926
#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]

/usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
14
#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c:
364
#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S:
322

Trying to sync to drbd2

crash> super_block ff4103b252bc8000
struct super_block {
  s_list = {
    next = 0xff4103b218029800,
    prev = 0xff4103b252bcc000
  },
  s_dev = 154140674,
  s_blocksize_bits = 12 '\f',
  s_blocksize = 4096,
  s_maxbytes = 17592186040320,
  s_type = 0xffffffffc16615e0 <ldiskfs_fs_type>,
  s_op = 0xffffffffc1632d60 <ldiskfs_sops>,
  dq_op = 0xffffffffc1632ee0 <ldiskfs_quota_operations>,
  s_qcop = 0xffffffff88e47160 <dquot_quotactl_sysfile_ops>,
  s_export_op = 0xffffffffc1632d00 <ldiskfs_export_ops>,
  s_flags = 1879113728,
  s_iflags = 9,
  s_magic = 61267,
  s_root = 0xff4103b48587d380,
  s_umount = {
    count = {
      counter = 1536
    },
..
..
s_id = "drbd2\0"
 No pending requests so not waiting in the block layer

147 ff4103b215d2e000   drbd2      ff4103b27ddddf58       0     0     0

First PID to hang wa sthis one

PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"
 #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c:
3804
 #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h:
15
 #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c
[jbd2]

/usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h:
462
 #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/fsync.c: 152
 #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]

/usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c:
5108
 #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/md_object.h:
564
 #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]
    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/mdt/mdt_handler.c: 3121
 #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]
 #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723
[ptlrpc]
 #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]
#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c:
364
#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f

/usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S:
322

crash> file ff4103cd3f0e9900
struct file {
  f_u = {
    fu_llist = {
      next = 0x0
    },
    fu_rcuhead = {
      next = 0x0,
      func = 0x0
    }
  },
  f_path = {
    mnt = 0xff4103b427901ca0,
    dentry = 0xff4103d887835140
  },
  f_inode = 0xff4103cddbaf3e58,
  f_op = 0xffffffffc1631540 <ldiskfs_file_operations>,
  f_lock = {

crash> super_block 0xff4103b252bc8000 | grep s_id
  s_id =
"drbd2\000s\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",

Best regards,
GM

--

*Georgios Magklaras PhD*
Chief Engineer
IT Infrastructure/HPC
The Norwegian Meteorological Institute

https://www.met.no/
https://www.steelcyber.com/georgioshome/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250805/7fe386fb/attachment-0001.htm>