<div dir="ltr"><div>Hi,</div><div><br></div><div>We would like to inform about (what we believe is) a journaling deadlock pattern in our MDS setup that is crippling our production. We have seen this on different hardware (Dell and Supermicro servers) with different storage controllers that have been thoroughly checked hardware/firmware wise. We have reproduced it consistently with the following software stack combinations:</div><div><br></div><div>Kernels: 4.18.0-553.5.1.el8_lustre.x86_64, 4.18.0-553.27.1.el8_lustre.x86_64 and 4.18.0-553.53.1.el8_lustre.x86_64 </div><div>Lustre versions: 2.15.5, 2.15.6 and 2.15.7 (the bug exists in all versions)</div><div>DRBD kmod: 9.1.23</div><div><br></div><div>What we are hoping to achieve with this message is:</div><div>-To see and inform if someone has seen this bug with similar software stacks.</div><div>-To examine if our plan to upgrade the MDS server to RHEL 9 and 2.16.1 (5.14.0-427.31.1_lustre.el9.x86_64 with the latest kmod-drbd: 9.2.14) is going to fix the issue</div><div><br></div><div>We have also reported the issue to RedHat (case number 04215952) and they/we believe that </div><div><p>A bit more about our setup and the problem with some vmcore analysis that we did with the RedHat storage and kernel engineers:</p><p>Our system is ldiskfs based and we are using DRBD to replicate the metadata targets. We have done that successfully for a number of years. Here is how our DRBD setup looks like:</p><p>[root@lustrea-mds-nx10077836-an ~]# lsblk<br>NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT<br>sda               8:0    0 446.6G  0 disk <br>├─sda1            8:1    0   600M  0 part /boot/efi<br>├─sda2            8:2    0     4G  0 part /boot<br>└─sda3            8:3    0   249G  0 part <br>  ├─rhel-root   253:0    0   150G  0 lvm  /<br>  ├─rhel-swap   253:1    0    20G  0 lvm  [SWAP]<br>  ├─rhel-backup 253:2    0    47G  0 lvm  /backup<br>  └─rhel-var    253:3    0    32G  0 lvm  /var<br>sdb               8:16   0  17.5T  0 disk <br>├─sdb1            8:17   0   1.8T  0 part <br>│ └─drbd0       147:0    0   1.8T  0 disk /lustre/mgs<br>├─sdb2            8:18   0   3.7T  0 part <br>│ └─drbd1       147:1    0   3.7T  0 disk /lustre/mdt0<br>└─sdb3            8:19   0   3.7T  0 part <br>  └─drbd2       147:2    0   3.7T  0 disk /lustre/mdt1<br></p><p>In terms of the problem, what happens is that either I/O is blocked on drbd1 (mdt0) or drbd2 (mdt1) with messages like this:</p></div><span class="gmail_signature_prefix">Jul 17 07:24:19 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task ldlm_cn19_000:17761 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task mdt21_011:19591 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task mdt21_010:19584 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task mdt21_009:19574 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task mdt21_008:19564 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task jbd2/drbd1-8:17843 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task ldlm_cn19_002:17763 blocked for more than 120 seconds.<br>Jul 17 07:22:16 <a href="http://lustrea-mds-nx10077836-an.int.met.no">lustrea-mds-nx10077836-an.int.met.no</a> kernel: INFO: task ldlm_cn19_000:17761 blocked for more than 120 seconds.<br><br>At that point, the DRBD processes remain blocked, until we powercycle the server. It is not possible to cleanly unmount the metadata target of the blocked drbd device. We can umount the mgs (drbd0) and the other drbd device OK (only one blocks at the time). We need to powercycle the server causing production outages, fsck ops that create serious problems to our operational environment.  The frequency of the problem occurrence varies. It could be anything from 12 hours to several days for some filesystems and over a month for other filesystems (with the same software stack).</span><div><br></div><div>We (together with RedHat kernel engineers) do not believe DRBD is to blame here. In fact, we have gone as far as to run the active MDS in standalone mode (no DRBD replication) and we still have triggered the deadlock. We suspect something happens in:</div><div>qsd_reserve_or_free_quota<br> ↳ osd_reserve_or_free_quota<br> ↳ ldiskfs_sync_fs<br> ↳ jbd2_transaction_committed<br></div><div><br></div><div>Thanks for any hints you can provide!</div><div><div><br></div><div>A little bit of kdump/vmcore analysis reveals the deadlock between jbd2 and the way lustre is using it. What we see in the ldiskfs layer is a class of threads (MDT and LDLM?) that are stuck. Every one of them lands up blocking here when ldiskfs_sync_fs is called. See below: </div><div><br></div><div><br></div><div><pre class="gmail-push-top-narrow gmail-comment-plaintext gmail-reset-pre-text" style="box-sizing:border-box;margin-top:0px;margin-bottom:0px;padding:0px;font-family:Consolas,Monaco,"Andale Mono",monospace;overflow:auto;font-size:0.8125rem;line-height:1.42857;word-break:normal;color:rgb(51,51,51);background-image:none;background-position:initial;background-size:initial;background-repeat:initial;background-origin:initial;background-clip:initial;border:none;border-radius:0px"><br></pre></div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>-- </div><div>--crash> ps -m | grep UN</div>[0 00:19:19.605] [UN]  PID: 464849   TASK: ff4103d58655c000  CPU: 38   COMMAND: "ldlm_cn19_006"<br>[0 00:19:56.741] [UN]  PID: 55781    TASK: ff4103b341b68000  CPU: 10   COMMAND: "mdt05_001"<br>[0 00:21:32.753] [UN]  PID: 125286   TASK: ff4103c86207c000  CPU: 74   COMMAND: "mdt13_009"<br>[0 00:22:56.640] [UN]  PID: 55720    TASK: ff4103b2fd31c000  CPU: 38   COMMAND: "ldlm_cn19_000"<br>[0 00:26:43.200] [UN]  PID: 57585    TASK: ff4103b373fcc000  CPU: 40   COMMAND: "ldlm_cn20_004"<br>[0 00:27:22.124] [UN]  PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"<br>[0 00:27:23.085] [UN]  PID: 57687    TASK: ff4103b29eae0000  CPU: 37   COMMAND: "mdt18_012"<br>[0 00:27:25.832] [UN]  PID: 57639    TASK: ff4103b3d2318000  CPU: 77   COMMAND: "mdt14_013"<br>[0 00:27:26.014] [UN]  PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"<br><br><br>crash> set ff4103b2ddb58000<br>    PID: 57703<br>COMMAND: "mdt18_013"<br>   TASK: ff4103b2ddb58000  [THREAD_INFO: ff4103b2ddb58000]<br>    CPU: 84<br>  STATE: TASK_UNINTERRUPTIBLE <br>crash> bt<br>PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"<br> #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01<br> #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5<br> #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c [jbd2]<br> #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]<br> #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]<br> #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]<br> #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]<br> #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]<br> #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723 [ptlrpc]<br> #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]<br>#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24<br>#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f<br>crash> whatis __pfx_jbd2_transaction_committed<br><text variable, no debug info> __pfx_jbd2_transaction_committed;</div><div dir="ltr">crash> bt<br>PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"<br> #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01<br> #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5<br> #2 [ff75fb8e88fe7c30] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]<br> #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]<br> #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]<br> #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]<br> #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]<br> #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]<br> #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723 [ptlrpc]<br> #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]<br>#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24<br>#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f<br>crash> sym jbd2_log_wait_commit<br>ffffffffc02f3b90 (T) jbd2_log_wait_commit [jbd2] /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/fs/jbd2/journal.c: 681<br><br>Now the backtrace looks much better:<br><br>crash> dis -r ffffffffc02f3c3c | tail -10<br>0xffffffffc02f3c1e <jbd2_log_wait_commit+0x8e>:  mov    %r12d,%eax<br>0xffffffffc02f3c21 <jbd2_log_wait_commit+0x91>:      sub    0x3c0(%rbx),%eax<br>0xffffffffc02f3c27 <jbd2_log_wait_commit+0x97>:        test   %eax,%eax<br>0xffffffffc02f3c29 <jbd2_log_wait_commit+0x99>:        jle    0xffffffffc02f3be8 <jbd2_log_wait_commit+0x58><br>0xffffffffc02f3c2b <jbd2_log_wait_commit+0x9b>:    xor    %esi,%esi<br>0xffffffffc02f3c2d <jbd2_log_wait_commit+0x9d>:       mov    %rsp,%rdi<br>0xffffffffc02f3c30 <jbd2_log_wait_commit+0xa0>:       call   0xffffffff87f4e510 <init_wait_entry><br>0xffffffffc02f3c35 <jbd2_log_wait_commit+0xa5>:       jmp    0xffffffffc02f3c3c <jbd2_log_wait_commit+0xac><br>0xffffffffc02f3c37 <jbd2_log_wait_commit+0xa7>:    call   0xffffffff8881e2a0 <schedule><br>0xffffffffc02f3c3c <jbd2_log_wait_commit+0xac>:      mov    $0x2,%edx<br>crash><br>crash> dis -r ffffffffc16174f5 | tail <br>0xffffffffc16174d6 <ldiskfs_sync_file+0x346>:   jmp    0xffffffffc16173f3 <ldiskfs_sync_file+0x263><br>0xffffffffc16174db <ldiskfs_sync_file+0x34b>:        mov    %r13d,%esi<br>0xffffffffc16174de <ldiskfs_sync_file+0x34e>:        mov    %r15,%rdi<br>0xffffffffc16174e1 <ldiskfs_sync_file+0x351>: call   0xffffffffc02f41a0 <jbd2_trans_will_send_data_barrier><br>0xffffffffc16174e6 <ldiskfs_sync_file+0x356>:       mov    %r13d,%esi<br>0xffffffffc16174e9 <ldiskfs_sync_file+0x359>:        mov    %r15,%rdi<br>0xffffffffc16174ec <ldiskfs_sync_file+0x35c>: test   %eax,%eax<br>0xffffffffc16174ee <ldiskfs_sync_file+0x35e>:  je     0xffffffffc16174fd <ldiskfs_sync_file+0x36d><br>0xffffffffc16174f0 <ldiskfs_sync_file+0x360>:        call   0xffffffffc02f5a40 <jbd2_complete_transaction><br>0xffffffffc16174f5 <ldiskfs_sync_file+0x365>:       mov    %eax,%r12d<br>crash> dis -r ffffffffc02f3c3c | head <br>0xffffffffc02f3b90 <jbd2_log_wait_commit>:        nopl   0x0(%rax,%rax,1) [FTRACE NOP]<br>0xffffffffc02f3b95 <jbd2_log_wait_commit+0x5>:     push   %r15<br>0xffffffffc02f3b97 <jbd2_log_wait_commit+0x7>:      push   %r14<br>0xffffffffc02f3b99 <jbd2_log_wait_commit+0x9>:      mov    $0xfffffe00,%r14d<br>0xffffffffc02f3b9f <jbd2_log_wait_commit+0xf>:        push   %r13<br>0xffffffffc02f3ba1 <jbd2_log_wait_commit+0x11>:     push   %r12<br>0xffffffffc02f3ba3 <jbd2_log_wait_commit+0x13>:     mov    %esi,%r12d<br>0xffffffffc02f3ba6 <jbd2_log_wait_commit+0x16>:      push   %rbp<br>0xffffffffc02f3ba7 <jbd2_log_wait_commit+0x17>:     lea    0x44(%rdi),%rbp<br>0xffffffffc02f3bab <jbd2_log_wait_commit+0x1b>: push   %rbx<br><br><br>crash> journal_t.j_wait_commit ff4103b2f19c7800<br>  j_wait_commit = {<br>    lock = {<br>      {<br>        rlock = {<br>          raw_lock = {<br>            {<br>              val = {<br>                counter = 0x0<br>              },<br>              {<br>                locked = 0x0,<br>                pending = 0x0<br>              },<br>              {<br>                locked_pending = 0x0,<br>                tail = 0x0<br>              }<br>            }<br>          }<br>        }<br>      }<br>    },<br>    head = {<br>      next = 0xff4103b2f19c78c0,<br>      prev = 0xff4103b2f19c78c0<br>    }<br>  },<br><br><br>So the ldiskfs is using the jbd2 for journaling.<br><br>TID: 000000001c0d1446<br>crash> journal_t.j_commit_sequence ff4103b2f19c7800<br>  j_commit_sequence = 0x1c0d1445,<br><br>crash> waitq journal_t.j_wait_done_commit ff4103b2f19c7800<br>PID: 464849   TASK: ff4103d58655c000  CPU: 38   COMMAND: "ldlm_cn19_006"<br>PID: 55781    TASK: ff4103b341b68000  CPU: 10   COMMAND: "mdt05_001"<br>PID: 125286   TASK: ff4103c86207c000  CPU: 74   COMMAND: "mdt13_009"<br>PID: 55720    TASK: ff4103b2fd31c000  CPU: 38   COMMAND: "ldlm_cn19_000"<br>PID: 57585    TASK: ff4103b373fcc000  CPU: 40   COMMAND: "ldlm_cn20_004"<br>PID: 57687    TASK: ff4103b29eae0000  CPU: 37   COMMAND: "mdt18_012"<br>PID: 57639    TASK: ff4103b3d2318000  CPU: 77   COMMAND: "mdt14_013"<br>PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"<br>PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"<br><br>---<br>crash> bt ff4103b427844000<br>PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"<br> #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01<br> #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5<br> #2 [ff75fb8e7c3339d8] jbd2_log_wait_commit at ffffffffc02f3c3c [jbd2]<br> #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]<br> #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]<br> #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]<br> #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]<br> #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5 [lquota]<br> #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86 [osd_ldiskfs]<br> #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]<br>#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]<br>#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85 [ldiskfs]<br>#12 [ff75fb8e7c333cf8] jbd2_journal_commit_transaction at ffffffffc02ef960 [jbd2]<br>#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]<br>#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24<br>#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f<br></div><div dir="ltr"><br></div><div dir="ltr">[293564.548588] Lustre: Skipped 1 previous similar message<br>[293564.548588] Pid: 57703, comm: mdt18_013 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024<br>[293564.548591] Call Trace TBD:<br>[293564.548607] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]<br>[293564.548615] [<0>] ldiskfs_sync_file+0x365/0x3b0 [ldiskfs]<br>[293564.548634] [<0>] osd_object_sync+0x188/0x1e0 [osd_ldiskfs]<br>[293564.548645] [<0>] mdt_object_sync.isra.69+0x7a/0x3b0 [mdt]<br>[293564.548664] [<0>] mdt_sync+0xf3/0x320 [mdt]<br>[293564.548681] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]<br>[293564.548733] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]<br>[293564.548780] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]<br>[293564.548823] [<0>] kthread+0x134/0x150<br>[293564.548826] [<0>] ret_from_fork+0x1f/0x40<br><br>[293564.548830] Pid: 57639, comm: mdt14_013 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024<br>[293564.548835] Call Trace TBD:<br>[293564.548856] [<0>] jbd2_log_wait_commit+0xac/0x120 [jbd2]<br>[293564.548864] [<0>] ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]<br>[293564.548881] [<0>] osd_sync+0xdf/0x180 [osd_ldiskfs]<br>[293564.548888] INFO: task jbd2/drbd2-8:56529 blocked for more than 120 seconds.<br>[293564.548894]       Tainted: G           OE     -------- -  - 4.18.0-553.5.1.el8_lustre.x86_64 #1<br>[293564.549271] [<0>] qsd_acquire+0x46c/0xde0 [lquota]<br>[293564.550413] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.<br>[293564.550412] [<0>] qsd_op_begin0+0x165/0x8a0 [lquota]<br><br>[293564.550642] task:jbd2/drbd2-8    state:D stack:0     pid:56529 ppid:2      flags:0x80004080<br>[293564.550641] [<0>] qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]<br>[293564.550648] Call Trace:<br>[293564.550652]  __schedule+0x2d1/0x870<br>[293564.550652] [<0>] osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]<br>[293564.550659]  ? __wake_up_common_lock+0x89/0xc0<br>[293564.550663]  schedule+0x55/0xf0<br>[293564.550663] [<0>] dt_reserve_or_free_quota.constprop.45+0x2c/0xd0 [mdd]<br>[293564.550667]  jbd2_log_wait_commit+0xac/0x120 [jbd2]<br>[293564.550675]  ? finish_wait+0x80/0x80<br>[293564.550675] [<0>] mdd_attr_set+0xa70/0x1020 [mdd]<br>[293564.550679]  ldiskfs_sync_fs+0x170/0x1d0 [ldiskfs]<br>[293564.550685] [<0>] mdt_reint_setattr+0xea8/0x1720 [mdt]<br>[293564.550695]  osd_sync+0xdf/0x180 [osd_ldiskfs]<br>[293564.550708]  qsd_acquire+0x46c/0xde0 [lquota]<br>[293564.550707] [<0>] mdt_reint_rec+0x11f/0x270 [mdt]<br>[293564.550723]  qsd_op_begin0+0x165/0x8a0 [lquota]<br>[293564.550725] [<0>] mdt_reint_internal+0x4d0/0x7e0 [mdt]<br>[293564.550733]  ? lfsck_key_init+0x1c/0x130 [lfsck]<br>[293564.550747]  ? srso_alias_return_thunk+0x5/0xfcdfd<br>[293564.550743] [<0>] mdt_reint+0x5d/0x110 [mdt]<br>[293564.550749]  ? kmem_cache_alloc_trace+0x142/0x280<br>[293564.550754]  qsd_reserve_or_free_quota+0x2e5/0x7d0 [lquota]<br>[293564.550764]  osd_reserve_or_free_quota+0xe6/0x1c0 [osd_ldiskfs]<br>[293564.550775]  ? lu_context_init+0xa5/0x1b0 [obdclass]<br>[293564.550760] [<0>] tgt_request_handle+0xccd/0x1a20 [ptlrpc]<br>[293564.550807]  tgt_cb_last_committed+0xc2/0x5c0 [ptlrpc]<br>[293564.550817] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]<br>[293564.550863]  osd_trans_commit_cb+0xe6/0x300 [osd_ldiskfs]<br>[293564.550875]  ldiskfs_journal_commit_callback+0xa5/0xe0 [ldiskfs]<br>[293564.550865] [<0>] ptlrpc_main+0xbec/0x1530 [ptlrpc]<br>[293564.550888]  jbd2_journal_commit_transaction+0x1690/0x19f0 [jbd2]<br>[293564.550897]  ? __try_to_del_timer_sync+0x5a/0xa0<br>[293564.550900]  kjournald2+0xbd/0x270 [jbd2]<br>[293564.550904]  ? finish_wait+0x80/0x80<br>[293564.550907]  ? commit_timeout+0x10/0x10 [jbd2]<br>[293564.550908] [<0>] kthread+0x134/0x150<br>[293564.550911]  kthread+0x134/0x150<br>[293564.550912] [<0>] ret_from_fork+0x1f/0x40<br>[293564.550915]  ? set_kthread_struct+0x50/0x50<br>[293564.550918]  ret_from_fork+0x1f/0x40</div><div dir="ltr"><br></div><div dir="ltr">__pfx_jbd2_transaction_committed is bd2_transaction_committed<br><br>__pfx_jbd2_transaction_committed should map to jbd2_transaction_committed<br><br>PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"<br> #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01<br> #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5<br> #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c [jbd2]  <br> #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]<br> #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]<br> #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]<br> #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]<br> #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5 [lquota]<br> #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86 [osd_ldiskfs]<br> #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]<br>#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]<br>#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85 [ldiskfs]<br>#12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at ffffffffc02ef960 [jbd2]<br>#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]<br>#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24<br>#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f<br><br><br>Focusing on osd_sync -> ldiskfs_sync_fs -> __pfx_jbd2_transaction_committed to see why we stalled<br><br>/* Return 1 when transaction with given tid has already committed. */<br>int jbd2_transaction_committed(journal_t *journal, tid_t tid)<br>{<br>        int ret = 1;<br><br>        read_lock(&journal->j_state_lock);   ****************** Seems to be a rwlock_t j_state_lock;<br>        if (journal->j_running_transaction &&<br>            journal->j_running_transaction->t_tid == tid)<br>                ret = 0;<br>        if (journal->j_committing_transaction &&<br>            journal->j_committing_transaction->t_tid == tid)<br>                ret = 0;<br>        read_unlock(&journal->j_state_lock);<br>        return ret;<br>}<br>EXPORT_SYMBOL(jbd2_transaction_committed);<br><br><br>What does not make sense is how we get to schedule() in each case, if its jbd2_transaction_committed<br>then maybe here<br><br>0xffffffffc16174a5 <ldiskfs_sync_file+789>: mov    %r12,%rdi<br>0xffffffffc16174a8 <ldiskfs_sync_file+792>:   mov    %r14,%r15<br>0xffffffffc16174ab <ldiskfs_sync_file+795>:   call   0xffffffff881a4170 <sync_inode_metadata><br>0xffffffffc16174b0 <ldiskfs_sync_file+800>:       mov    %eax,%r13d<br>0xffffffffc16174b3 <ldiskfs_sync_file+803>:  test   %eax,%eax<br>0xffffffffc16174b5 <ldiskfs_sync_file+805>:    jne    0xffffffffc161752e <ldiskfs_sync_file+926><br>0xffffffffc16174b7 <ldiskfs_sync_file+807>:    mov    -0x98(%r12),%rax<br>0xffffffffc16174bf <ldiskfs_sync_file+815>:    movabs $0x10000000000,%rsi<br>0xffffffffc16174c9 <ldiskfs_sync_file+825>:   test   %rsi,%rax<br>0xffffffffc16174cc <ldiskfs_sync_file+828>:    jne    0xffffffffc161746a <ldiskfs_sync_file+730><br>0xffffffffc16174ce <ldiskfs_sync_file+830>:    mov    %r15,%rdi<br>0xffffffffc16174d1 <ldiskfs_sync_file+833>:   call   0xffffffff8818a140 <dput><br>0xffffffffc16174d6 <ldiskfs_sync_file+838>:      jmp    0xffffffffc16173f3 <ldiskfs_sync_file+611><br>0xffffffffc16174db <ldiskfs_sync_file+843>:    mov    %r13d,%esi<br>0xffffffffc16174de <ldiskfs_sync_file+846>:  mov    %r15,%rdi<br>0xffffffffc16174e1 <ldiskfs_sync_file+849>:   call   0xffffffffc02f41a0 <jbd2_trans_will_send_data_barrier><br>0xffffffffc16174e6 <ldiskfs_sync_file+854>: mov    %r13d,%esi<br>0xffffffffc16174e9 <ldiskfs_sync_file+857>:  mov    %r15,%rdi<br>0xffffffffc16174ec <ldiskfs_sync_file+860>:   test   %eax,%eax<br>0xffffffffc16174ee <ldiskfs_sync_file+862>:    je     0xffffffffc16174fd <ldiskfs_sync_file+877><br>0xffffffffc16174f0 <ldiskfs_sync_file+864>:    call   0xffffffffc02f5a40 <__pfx_jbd2_journal_start_commit><br>0xffffffffc16174f5 <ldiskfs_sync_file+869>:   mov    %eax,%r12d<br><br><br>called here<br>0xffffffffc16174f0 <ldiskfs_sync_file+864>: call   0xffffffffc02f5a40 <__pfx_jbd2_journal_start_commit><br>0xffffffffc16174f5 <ldiskfs_sync_file+869>:   mov    %eax,%r12d<br>0xffffffffc16174f8 <ldiskfs_sync_file+872>:  jmp    0xffffffffc16172e7 <ldiskfs_sync_file+343><br>0xffffffffc16174fd <ldiskfs_sync_file+877>:    call   0xffffffffc02f5a40 <__pfx_jbd2_journal_start_commit><br>0xffffffffc1617502 <ldiskfs_sync_file+882>:   mov    %eax,%r13d<br><br><br><br>So stuck in this loop<br><br>xffffffffc02f3c20 <__pfx_jbd2_transaction_committed>: loopne 0xffffffffc02f3c4d <__pfx_jbd2_transaction_committed+45><br>0xffffffffc02f3c22 <__pfx_jbd2_transaction_committed+2>:   add    $0x3,%eax<br>0xffffffffc02f3c25 <__pfx_jbd2_transaction_committed+5>:      add    %al,(%rax)<br>0xffffffffc02f3c27 <__pfx_jbd2_transaction_committed+7>:     test   %eax,%eax<br><br>rest of the data<br>----------------<br>crash> bt -l<br>PID: 56529    TASK: ff4103b427844000  CPU: 4    COMMAND: "jbd2/drbd2-8"<br> #0 [ff75fb8e7c333968] __schedule at ffffffff8881dd01<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c: 3804<br> #1 [ff75fb8e7c3339c0] schedule at ffffffff8881e2f5<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h: 15<br> #2 [ff75fb8e7c3339d8] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c [jbd2]<br>    /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h: 462<br> #3 [ff75fb8e7c333a40] ldiskfs_sync_fs at ffffffffc1606840 [ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/super.c: 5310<br> #4 [ff75fb8e7c333a78] osd_sync at ffffffffc16972df [osd_ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c: 2497<br> #5 [ff75fb8e7c333a98] qsd_acquire at ffffffffc156d56c [lquota]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_internal.h: 217<br> #6 [ff75fb8e7c333b18] qsd_op_begin0 at ffffffffc156e045 [lquota]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c: 751<br> #7 [ff75fb8e7c333b98] qsd_reserve_or_free_quota at ffffffffc156ffd5 [lquota]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/quota/qsd_handler.c: 1358<br> #8 [ff75fb8e7c333bc8] osd_reserve_or_free_quota at ffffffffc1696c86 [osd_ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/lustre_quota.h: 276<br> #9 [ff75fb8e7c333c10] tgt_cb_last_committed at ffffffffc11e3512 [ptlrpc]<br>#10 [ff75fb8e7c333c78] osd_trans_commit_cb at ffffffffc1699376 [osd_ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c: 1795<br>#11 [ff75fb8e7c333cc0] ldiskfs_journal_commit_callback at ffffffffc1607b85 [ldiskfs]<br>    /var/lib/jenkins/workspace/lustre-b2_15/arch/x86_64/build_type/server/distro/el8.10/ib_stack/inkernel/BUILD/reused/usr/src/kernels/4.18.0-553.5.1.el8_lustre.x86_64/include/linux/spinlock.h: 350<br>#12 [ff75fb8e7c333cf8] __pfx_jbd2_journal_commit_transaction at ffffffffc02ef960 [jbd2]<br>    /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/fs/jbd2/commit.c: 926<br>#13 [ff75fb8e7c333ea0] kjournald2 at ffffffffc02f3fed [jbd2]<br>    /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h: 14<br>#14 [ff75fb8e7c333f10] kthread at ffffffff87f20e24<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c: 364<br>#15 [ff75fb8e7c333f50] ret_from_fork at ffffffff88a0028f<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S: 322<br><br>Trying to sync to drbd2<br><br>crash> super_block ff4103b252bc8000<br>struct super_block {<br>  s_list = {<br>    next = 0xff4103b218029800,<br>    prev = 0xff4103b252bcc000<br>  },<br>  s_dev = 154140674,<br>  s_blocksize_bits = 12 '\f',<br>  s_blocksize = 4096,<br>  s_maxbytes = 17592186040320,<br>  s_type = 0xffffffffc16615e0 <ldiskfs_fs_type>,<br>  s_op = 0xffffffffc1632d60 <ldiskfs_sops>,<br>  dq_op = 0xffffffffc1632ee0 <ldiskfs_quota_operations>,<br>  s_qcop = 0xffffffff88e47160 <dquot_quotactl_sysfile_ops>,<br>  s_export_op = 0xffffffffc1632d00 <ldiskfs_export_ops>,<br>  s_flags = 1879113728,<br>  s_iflags = 9,<br>  s_magic = 61267,<br>  s_root = 0xff4103b48587d380,<br>  s_umount = {<br>    count = {<br>      counter = 1536<br>    },<br>..<br>..<br>s_id = "drbd2\0"<br> No pending requests so not waiting in the block layer<br><br>147 ff4103b215d2e000   drbd2      ff4103b27ddddf58       0     0     0<br><br>First PID to hang wa sthis one<br><br>PID: 57703    TASK: ff4103b2ddb58000  CPU: 84   COMMAND: "mdt18_013"<br> #0 [ff75fb8e88fe7bc0] __schedule at ffffffff8881dd01<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/sched/core.c: 3804<br> #1 [ff75fb8e88fe7c18] schedule at ffffffff8881e2f5<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/./arch/x86/include/asm/current.h: 15<br> #2 [ff75fb8e88fe7c30] __pfx_jbd2_transaction_committed at ffffffffc02f3c3c [jbd2]<br>    /usr/src/debug/kernel-5.14.0-427.31.1.el9_4/linux-5.14.0-427.31.1_lustre.el9.x86_64/./include/trace/events/jbd2.h: 462<br> #3 [ff75fb8e88fe7c98] ldiskfs_sync_file at ffffffffc16174f5 [ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/ldiskfs/fsync.c: 152<br> #4 [ff75fb8e88fe7cd8] osd_object_sync at ffffffffc1698378 [osd_ldiskfs]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/osd-ldiskfs/osd_handler.c: 5108<br> #5 [ff75fb8e88fe7d00] mdt_object_sync at ffffffffc18537aa [mdt]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/include/md_object.h: 564<br> #6 [ff75fb8e88fe7d20] mdt_sync at ffffffffc1863c63 [mdt]<br>    /usr/src/debug/lustre-2.15.5-1.el8.x86_64/lustre/mdt/mdt_handler.c: 3121<br> #7 [ff75fb8e88fe7d50] tgt_request_handle at ffffffffc11f82dd [ptlrpc]<br> #8 [ff75fb8e88fe7dd0] ptlrpc_server_handle_request at ffffffffc11a3723 [ptlrpc]<br> #9 [ff75fb8e88fe7e38] ptlrpc_main at ffffffffc11a521c [ptlrpc]<br>#10 [ff75fb8e88fe7f10] kthread at ffffffff87f20e24<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/kernel/kthread.c: 364<br>#11 [ff75fb8e88fe7f50] ret_from_fork at ffffffff88a0028f<br>    /usr/src/debug/kernel-4.18.0-553.5.1.el8_10/linux-4.18.0-553.5.1.el8_lustre.x86_64/arch/x86/entry/entry_64.S: 322<br><br><br>crash> file ff4103cd3f0e9900 <br>struct file {<br>  f_u = {<br>    fu_llist = {<br>      next = 0x0<br>    },<br>    fu_rcuhead = {<br>      next = 0x0,<br>      func = 0x0<br>    }<br>  },<br>  f_path = {<br>    mnt = 0xff4103b427901ca0,<br>    dentry = 0xff4103d887835140<br>  },<br>  f_inode = 0xff4103cddbaf3e58,<br>  f_op = 0xffffffffc1631540 <ldiskfs_file_operations>,<br>  f_lock = {<br><br><br>crash> super_block 0xff4103b252bc8000 | grep s_id<br>  s_id = "drbd2\000s\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",<br></div><div dir="ltr"><br></div><div dir="ltr"><br></div><div>Best regards,</div><div>GM</div><div dir="ltr"><br></div><div dir="ltr"><div>--</div><div><br></div><div><b>Georgios Magklaras PhD</b></div><div>Chief Engineer</div><div>IT Infrastructure/HPC</div><div>The Norwegian Meteorological Institute</div><div><br></div><div><a href="https://www.met.no/" target="_blank">https://www.met.no/</a></div><div><a href="https://www.steelcyber.com/georgioshome/" target="_blank">https://www.steelcyber.com/georgioshome/</a></div></div></div></div></div></div>