[lustre-discuss] [Lustre 2.15.2] OST thread hangs multiple times a day

Limstash Wong limstash.w at gmail.com
Mon Nov 4 19:19:36 PST 2024


Hello everyone,

I have just started using Lustre. My MGS/MDS/OSS are all running on one storage server, and there are three servers connected to the storage server via Infiniband.

The storage server is running Lustre 2.15.2, and the clients are running Lustre 2.15.3.

However, the storage server hangs multiple times a day and requires a forced reboot to recover. I have checked the disk read and write status using iostat, and everything seems normal.

Linux 4.18.0-425.3.1.el8_lustre.x86_64 (localhost.localdomain)         11/05/2024         _x86_64_        (72 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.05    0.12    0.00    0.00   99.83

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1           3.71        89.92        33.74    4147409    1556088
sdb              17.61      1622.88       860.78   74849392   39700340
sda              17.65      1623.40       862.03   74873003   39757960
dm-0              2.87        87.86        33.69    4052089    1554003
dm-1              0.00         0.05         0.00       2220          0
dm-2              0.01         0.27         0.00      12537         96
dm-3              1.16         0.95         4.36      44033     201252
dm-4             41.85      3244.86      1718.45  149656969   79256952

How should I debug or fix it? The dmesg is as follows:

[43709.744463] Lustre: ll_ost00_000: service thread pid 5419 was inactive for 201.493 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[43709.744465] Pid: 5429, comm: ll_ost_io00_000 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
[43709.744466] Lustre: ll_ost_io00_001: service thread pid 5430 was inactive for 201.544 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43709.744469] Lustre: Skipped 4 previous similar messages
[43709.744470] Call Trace TBD:
[43709.744469] Lustre: Skipped 2 previous similar messages
[43709.744491] [<0>] osd_read_lock+0x7d/0x100 [osd_ldiskfs]
[43709.744544] [<0>] ofd_preprw_write.isra.28+0x142/0x1240 [ofd]
[43709.744566] [<0>] ofd_preprw+0x7b2/0x900 [ofd]
[43709.744577] [<0>] obd_preprw+0x1a1/0x360 [ptlrpc]
[43709.744746] [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc]
[43709.744849] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[43709.744965] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[43709.745062] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[43709.745160] [<0>] kthread+0x10a/0x120
[43709.745167] [<0>] ret_from_fork+0x35/0x40
[43709.745173] Pid: 5431, comm: ll_ost_io00_002 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
[43709.745176] Call Trace TBD:
[43709.745187] [<0>] osd_read_lock+0x7d/0x100 [osd_ldiskfs]
[43709.745217] [<0>] ofd_preprw_write.isra.28+0x142/0x1240 [ofd]
[43709.745234] [<0>] ofd_preprw+0x7b2/0x900 [ofd]
[43709.745245] [<0>] obd_preprw+0x1a1/0x360 [ptlrpc]
[43709.745378] [<0>] tgt_brw_write+0x11cf/0x1ce0 [ptlrpc]
[43709.745502] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[43709.745606] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[43709.745706] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[43709.745819] [<0>] kthread+0x10a/0x120
[43709.745823] [<0>] ret_from_fork+0x35/0x40
[43709.745827] Pid: 5419, comm: ll_ost00_000 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
[43709.745830] Call Trace TBD:
[43709.745845] [<0>] wait_transaction_locked+0x89/0xd0 [jbd2]
[43709.745859] [<0>] add_transaction_credits+0xd4/0x290 [jbd2]
[43709.745864] [<0>] start_this_handle+0x10a/0x520 [jbd2]
[43709.745868] [<0>] jbd2__journal_restart+0xb4/0x160 [jbd2]
[43709.745873] [<0>] osd_fallocate_preallocate.isra.38+0x5a6/0x760 [osd_ldiskfs]
[43709.745904] [<0>] osd_fallocate+0xfd/0x370 [osd_ldiskfs]
[43709.745921] [<0>] ofd_object_fallocate+0x5dd/0xa30 [ofd]
[43709.745939] [<0>] ofd_fallocate_hdl+0x467/0x730 [ofd]
[43709.745948] [<0>] tgt_request_handle+0xc97/0x1a40 [ptlrpc]
[43709.746085] [<0>] ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
[43709.746204] [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
[43709.746307] [<0>] kthread+0x10a/0x120
[43709.746311] [<0>] ret_from_fork+0x35/0x40
[43713.840494] Lustre: ll_ost_io01_017: service thread pid 5618 was inactive for 200.952 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43713.840502] Lustre: Skipped 5 previous similar messages
[43717.936517] Lustre: ll_ost_io01_067: service thread pid 5670 was inactive for 200.956 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43717.936517] Lustre: ll_ost_io01_008: service thread pid 5608 was inactive for 204.028 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43717.936525] Lustre: Skipped 4 previous similar messages
[43717.936528] Lustre: Skipped 4 previous similar messages
[43722.032543] Lustre: ll_ost_io01_035: service thread pid 5638 was inactive for 202.868 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43722.032548] Lustre: Skipped 1 previous similar message
[43730.224582] Lustre: ll_ost_io01_069: service thread pid 5672 was inactive for 201.332 seconds. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[43746.608918] INFO: task jbd2/dm-4-8:5413 blocked for more than 120 seconds.
[43746.608956]       Tainted: G           OE    --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
[43746.609020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[43746.609043] task:jbd2/dm-4-8     state:D stack:    0 pid: 5413 ppid:     2 flags:0x80004080
[43746.609048] Call Trace:
[43746.609051]  __schedule+0x2d1/0x860
[43746.609056]  ? finish_wait+0x80/0x80
[43746.609063]  schedule+0x35/0xa0
[43746.609066]  jbd2_journal_commit_transaction+0x259/0x1a00 [jbd2]
[43746.609078]  ? update_load_avg+0x7e/0x710
[43746.609085]  ? newidle_balance+0x279/0x3c0
[43746.609090]  ? finish_wait+0x80/0x80
[43746.609093]  ? __switch_to+0x10c/0x450
[43746.609101]  ? finish_task_switch+0xaf/0x2e0
[43746.609105]  ? lock_timer_base+0x67/0x90
[43746.609110]  kjournald2+0xbd/0x270 [jbd2]
[43746.609117]  ? finish_wait+0x80/0x80
[43746.609119]  ? commit_timeout+0x10/0x10 [jbd2]
[43746.609123]  kthread+0x10a/0x120
[43746.609127]  ? set_kthread_struct+0x50/0x50
[43746.609130]  ret_from_fork+0x35/0x40

Regards,
Limstash Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20241105/cef99d40/attachment.htm>


More information about the lustre-discuss mailing list