[lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up

Jane Liu zhliu at rcf.rhic.bnl.gov
Mon May 22 13:32:44 PDT 2023


Hi,

We found out a solution to address the issues of directory listing 
hang-ups and unresponsive OSS. There seemed metadata inconsistency in 
the MDT. While the on-the-fly "lctl lfsck_start ..." would not work, we 
tried the offline ext4 level fix. We unmounted the MDT and ran the 
e2fsck command to fix metadata inconsistencies. The commands we executed 
are as follows:

e2fsck -fp /dev/mapper/mds01-mds01

After the MDT was mounted, everything just worked.

Jane


On 2023-05-18 16:44, Jane Liu via lustre-discuss wrote:
> Hi,
> 
> We have recently upgraded our Lustre servers to run on RHEL 8.7, along
> with Lustre 2.15.2. Despite running smoothly for several weeks, we
> have encountered an issue that is same as the one reported on this
> webpage:
> https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-10697__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEtzN54FNF$
> . Although the Lustre version described there differs from ours, the
> symptoms are identical.
> 
> “uname -a” on our system returns the following output:
> 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
> x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release
> is: Red Hat Enterprise Linux release 8.7 (Ootpa)
> 
> Here are the details about the issue.
> 
> On 5/16, around 4:47 am, one OSS named oss29 began experiencing
> problems. There was a rapid increase in the number of active requests,
> from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around
> 5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active
> requests remained at 123. Concurrently, the system load experienced a
> significant increase, shooting up from a very low number to a high
> number as 400, again within the timeframe of 4:47 am to 10:20 am.
> Interestingly, despite the extreme system load, the CPU usage remained
> idle.
> 
> Furthermore, when executing the lfs df command on the MDS, no OSTs on
> oss29 were visible.
> 
> We noticed a lot of error about “This server is not able to keep up
> with request traffic (cpu-bound)” in syslog on oss29:
> 
> May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able
> to keep up with request traffic (cpu-bound).
> May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able
> to keep up with request traffic (cpu-bound).
> May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able
> to keep up with request traffic (cpu-bound).
> May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able
> to keep up with request traffic (cpu-bound).
> May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able
> to keep up with request traffic (cpu-bound).
>> 
> At the same time, users also reported an issue with a specific
> directory, which became inaccessible. Running ls -l on this directory
> resulted in a hang, while the ls command worked. Users found they
> could read certain files within the directory, but not all of them.
> 
> In an attempt to fix the situation, I tried to unmount the OSTs on
> oss29, but this was unsuccessful. We then made the decision to reboot
> oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29
> immediately reverted to its previous unresponsive state with high
> load. Listing the directory still hung.
> 
> I did lfsck on MDT, but it just hung there.
> 
> Here are related MDS syslog during the perios:
> 
> May 16 05:09:13 sphnxmds01 kernel: Lustre:
> sphnx01-OST0192-osc-MDT0000: Connection to sphnx01-OST0192 (at
> 10.42.73.42 at tcp) was lost; in progress operations using this service
> will wait for recovery to complete
> May 16 05:09:13 sphnxmds01 kernel: LustreError:
> 384795:0:(osp_precreate.c:677:osp_precreate_send())
> sphnx01-OST0192-osc-MDT0000: can't precreate: rc = -11
> May 16 05:09:13 sphnxmds01 kernel: Lustre:
> sphnx01-OST0192-osc-MDT0000: Connection restored to 10.42.73.42 at tcp
> (at 10.42.73.42 at tcp)
> May 16 05:09:13 sphnxmds01 kernel: LustreError:
> 384795:0:(osp_precreate.c:1340:osp_precreate_thread())
> sphnx01-OST0192-osc-MDT0000: cannot precreate objects: rc = -11
> 
>> May 16 05:22:17 sphnxmds01 kernel: Lustre:
> sphnx01-OST0192-osc-MDT0000: Connection to sphnx01-OST0192 (at
> 10.42.73.42 at tcp) was lost; in progress operations using this service
> will wait for recovery to complete
> May 16 05:22:17 sphnxmds01 kernel: LustreError:
> 384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans())
> sphnx01-OST0192-osc-MDT0000: cannot cleanup orphans: rc = -11
> May 16 05:22:17 sphnxmds01 kernel: Lustre:
> sphnx01-OST0192-osc-MDT0000: Connection restored to 10.42.73.42 at tcp
> (at 10.42.73.42 at tcp)
> 
> And this is OSS syslog:
> 
> May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 0000000012e8b1d0 already connecting from 130.199.48.37 at tcp
> May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:47:36 oss29 kernel: Lustre: Skipped 1 previous similar 
> message
> May 16 04:47:41 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:47:51 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:47:51 oss29 kernel: Lustre: Skipped 1 previous similar 
> message
> May 16 04:48:11 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000009a00357a already connecting from 130.199.206.80 at tcp
> May 16 04:48:11 oss29 kernel: Lustre: Skipped 4 previous similar 
> messages
> May 16 04:48:43 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000000f4ce150 already connecting from 130.199.206.154 at tcp
> May 16 04:48:43 oss29 kernel: Lustre: Skipped 13 previous similar 
> messages
> May 16 04:49:43 oss29 kernel: Lustre: ll_ost_io01_102: service thread
> pid 822490 was inactive for 200.303 seconds. The thread m
> ight be hung, or it might only be slow and will resume later. Dumping
> the stack trace for debugging purposes:
> May 16 04:49:43 oss29 kernel: Pid: 604064, comm: ll_ost_io00_045
> 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UT
> C 2023
> May 16 04:49:43 oss29 kernel: Lustre: Skipped 1 previous similar 
> message
> May 16 04:49:43 oss29 kernel: Call Trace TBD:
> May 16 04:49:43 oss29 kernel: Lustre: ll_ost_io00_013: service thread
> pid 365744 was inactive for 200.187 seconds. Watchdog sta
> ck traces are limited to 3 per 300 seconds, skipping this one.
> May 16 04:49:43 oss29 kernel: Lustre: Skipped 1 previous similar 
> message
> May 16 04:49:43 oss29 kernel: [<0>] md_bitmap_startwrite+0x173/0x1e0
> May 16 04:49:43 oss29 kernel: [<0>] add_stripe_bio+0x4b8/0x7e0 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] raid5_make_request+0x218/0xc50 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] md_handle_request+0x128/0x1b0
> May 16 04:49:43 oss29 kernel: [<0>] md_make_request+0x5b/0xb0
> May 16 04:49:43 oss29 kernel: [<0>] 
> generic_make_request_no_check+0x202/0x330
> May 16 04:49:43 oss29 kernel: [<0>] submit_bio+0x3c/0x160
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw_write+0x77e/0x1ad0 
> [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_brw_write+0x139f/0x1ce0 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_request_handle+0xc97/0x1a40 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>]
> ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] kthread+0x10a/0x120
> May 16 04:49:43 oss29 kernel: [<0>] ret_from_fork+0x1f/0x40
> May 16 04:49:43 oss29 kernel: Pid: 333512, comm: ll_ost_io01_002
> 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UT
> C 2023
> May 16 04:49:43 oss29 kernel: Call Trace TBD:
> May 16 04:49:43 oss29 kernel: [<0>] md_bitmap_startwrite+0x173/0x1e0
> May 16 04:49:43 oss29 kernel: [<0>] add_stripe_bio+0x4b8/0x7e0 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] raid5_make_request+0x218/0xc50 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] md_handle_request+0x128/0x1b0
> May 16 04:49:43 oss29 kernel: [<0>] md_make_request+0x5b/0xb0
> May 16 04:49:43 oss29 kernel: [<0>] 
> generic_make_request_no_check+0x202/0x330
> May 16 04:49:43 oss29 kernel: [<0>] submit_bio+0x3c/0x160
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_do_bio.constprop.51+0xb63/0xc40 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw_write+0x77e/0x1ad0 
> [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_brw_write+0x139f/0x1ce0 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_request_handle+0xc97/0x1a40 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>]
> ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] kthread+0x10a/0x120
> May 16 04:49:43 oss29 kernel: [<0>] ret_from_fork+0x1f/0x40
> May 16 04:49:43 oss29 kernel: Pid: 822490, comm: ll_ost_io01_102
> 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UT
> C 2023
> May 16 04:49:43 oss29 kernel: Call Trace TBD:
> May 16 04:49:43 oss29 kernel: [<0>] md_bitmap_startwrite+0x173/0x1e0
> May 16 04:49:43 oss29 kernel: [<0>] add_stripe_bio+0x4b8/0x7e0 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] raid5_make_request+0x218/0xc50 
> [raid456]
> May 16 04:49:43 oss29 kernel: [<0>] md_handle_request+0x128/0x1b0
> May 16 04:49:43 oss29 kernel: [<0>] md_make_request+0x5b/0xb0
> May 16 04:49:43 oss29 kernel: [<0>] 
> generic_make_request_no_check+0x202/0x330
> May 16 04:49:43 oss29 kernel: [<0>] submit_bio+0x3c/0x160
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_do_bio.constprop.51+0x510/0xc40 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>]
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw_write+0x77e/0x1ad0 
> [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:49:43 oss29 kernel: [<0>] obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_brw_write+0x139f/0x1ce0 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] tgt_request_handle+0xc97/0x1a40 
> [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>]
> ptlrpc_server_handle_request+0x323/0xbe0 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:49:43 oss29 kernel: [<0>] kthread+0x10a/0x120
> May 16 04:49:43 oss29 kernel: [<0>] ret_from_fork+0x1f/0x40
> May 16 04:49:47 oss29 kernel: Lustre: ll_ost_io00_118: service thread
> pid 809074 was inactive for 204.003 seconds. Watchdog sta
> ck traces are limited to 3 per 300 seconds, skipping this one.
> May 16 04:49:47 oss29 kernel: Lustre: Skipped 9 previous similar 
> messages
> May 16 04:49:48 oss29 kernel: Lustre: sphnx01-OST0192: Export
> 000000000f4ce150 already connecting from 130.199.206.154 at tcp
> May 16 04:49:48 oss29 kernel: Lustre: Skipped 29 previous similar 
> messages
> May 16 04:49:51 oss29 kernel: Lustre: ll_ost_io01_087: service thread
> pid 822474 was inactive for 200.852 seconds. Watchdog sta
> ck traces are limited to 3 per 300 seconds, skipping this one.
> May 16 04:49:51 oss29 kernel: Lustre: Skipped 17 previous similar 
> messages
> May 16 04:50:36 oss29 kernel: Lustre: ll_ost01_000: service thread pid
> 333500 was inactive for 201.018 seconds. Watchdog stack
> traces are limited to 3 per 300 seconds, skipping this one.
> May 16 04:50:36 oss29 kernel: Lustre: Skipped 47 previous similar 
> messages
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost01_000:333500 blocked
> for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost01_000    state:D stack:    0
> pid:333500 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: wait_transaction_locked+0x89/0xd0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_transaction_credits+0xd4/0x290 [jbd2]
> May 16 04:50:36 oss29 kernel: start_this_handle+0x10a/0x520 [jbd2]
> May 16 04:50:36 oss29 kernel: ? jbd2__journal_start+0x8f/0x1f0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? kmem_cache_alloc+0x13f/0x280
> May 16 04:50:36 oss29 kernel: jbd2__journal_start+0xee/0x1f0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? osd_trans_start+0x13b/0x500 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: __ldiskfs_journal_start_sb+0x6e/0x140 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_trans_start+0x13b/0x500 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: tgt_client_data_update+0x468/0x6c0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? cfs_percpt_lock+0x16/0x100 [libcfs]
> May 16 04:50:36 oss29 kernel: tgt_client_new+0x5c2/0x880 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ofd_obd_connect+0x385/0x4f0 [ofd]
> May 16 04:50:36 oss29 kernel: target_handle_connect+0x611/0x29a0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? lock_timer_base+0x60/0x90
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0x569/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? __schedule+0x2d9/0x860
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost_io00_000:333507
> blocked for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost_io00_000 state:D stack:    0
> pid:333507 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: ? blk_flush_plug_list+0xc2/0x100
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: md_bitmap_startwrite+0x173/0x1e0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_stripe_bio+0x4b8/0x7e0 [raid456]
> May 16 04:50:36 oss29 kernel: raid5_make_request+0x218/0xc50 [raid456]
> May 16 04:50:36 oss29 kernel: ? generic_make_request_checks+0x2a5/0x4f0
> May 16 04:50:36 oss29 kernel: ? __bio_clone+0x60/0xe0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ? blk_queue_split+0xd4/0x670
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: md_handle_request+0x128/0x1b0
> May 16 04:50:36 oss29 kernel: md_make_request+0x5b/0xb0
> May 16 04:50:36 oss29 kernel: generic_make_request_no_check+0x202/0x330
> May 16 04:50:36 oss29 kernel: submit_bio+0x3c/0x160
> May 16 04:50:36 oss29 kernel: osd_do_bio.constprop.51+0x510/0xc40 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ? ldiskfs_map_blocks+0x607/0x610 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: ?
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_ldiskfs_map_inode_pages+0x873/0x8f0
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ofd_commitrw_write+0x77e/0x1ad0 [ofd]
> May 16 04:50:36 oss29 kernel: ? cfs_crypto_hash_final+0x63/0x90 
> [libcfs]
> May 16 04:50:36 oss29 kernel: ? kfree+0xd0/0x250
> May 16 04:50:36 oss29 kernel: ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:50:36 oss29 kernel: ? obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? flush_work+0x42/0x1d0
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: ? _cond_resched+0x15/0x30
> May 16 04:50:36 oss29 kernel: ? mutex_lock+0xe/0x30
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost_io00_001:333508
> blocked for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost_io00_001 state:D stack:    0
> pid:333508 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: md_bitmap_startwrite+0x173/0x1e0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_stripe_bio+0x4b8/0x7e0 [raid456]
> May 16 04:50:36 oss29 kernel: raid5_make_request+0x218/0xc50 [raid456]
> May 16 04:50:36 oss29 kernel: ? generic_make_request_checks+0x2a5/0x4f0
> May 16 04:50:36 oss29 kernel: ? __bio_clone+0x60/0xe0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ? blk_queue_split+0xd4/0x670
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: md_handle_request+0x128/0x1b0
> May 16 04:50:36 oss29 kernel: md_make_request+0x5b/0xb0
> May 16 04:50:36 oss29 kernel: generic_make_request_no_check+0x202/0x330
> May 16 04:50:36 oss29 kernel: submit_bio+0x3c/0x160
> May 16 04:50:36 oss29 kernel: osd_do_bio.constprop.51+0x510/0xc40 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ? __ldiskfs_journal_stop+0x36/0xb0 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: ?
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_ldiskfs_map_inode_pages+0x873/0x8f0
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ofd_commitrw_write+0x77e/0x1ad0 [ofd]
> May 16 04:50:36 oss29 kernel: ? cfs_crypto_hash_final+0x63/0x90 
> [libcfs]
> May 16 04:50:36 oss29 kernel: ? kfree+0xd0/0x250
> May 16 04:50:36 oss29 kernel: ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:50:36 oss29 kernel: ? obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? flush_work+0x42/0x1d0
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: ? _cond_resched+0x15/0x30
> May 16 04:50:36 oss29 kernel: ? mutex_lock+0xe/0x30
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost_io00_002:333509
> blocked for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost_io00_002 state:D stack:    0
> pid:333509 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: md_bitmap_startwrite+0x173/0x1e0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_stripe_bio+0x4b8/0x7e0 [raid456]
> May 16 04:50:36 oss29 kernel: raid5_make_request+0x218/0xc50 [raid456]
> May 16 04:50:36 oss29 kernel: ? generic_make_request_checks+0x2a5/0x4f0
> May 16 04:50:36 oss29 kernel: ? __bio_clone+0x60/0xe0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ? blk_queue_split+0xd4/0x670
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: md_handle_request+0x128/0x1b0
> May 16 04:50:36 oss29 kernel: md_make_request+0x5b/0xb0
> May 16 04:50:36 oss29 kernel: generic_make_request_no_check+0x202/0x330
> May 16 04:50:36 oss29 kernel: submit_bio+0x3c/0x160
> May 16 04:50:36 oss29 kernel: osd_do_bio.constprop.51+0xb63/0xc40 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ? ldiskfs_map_blocks+0x607/0x610 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: ?
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_ldiskfs_map_inode_pages+0x873/0x8f0
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ofd_commitrw_write+0x77e/0x1ad0 [ofd]
> May 16 04:50:36 oss29 kernel: ? cfs_crypto_hash_final+0x63/0x90 
> [libcfs]
> May 16 04:50:36 oss29 kernel: ? kfree+0xd0/0x250
> May 16 04:50:36 oss29 kernel: ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:50:36 oss29 kernel: ? obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? flush_work+0x42/0x1d0
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: ? _cond_resched+0x15/0x30
> May 16 04:50:36 oss29 kernel: ? mutex_lock+0xe/0x30
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost_io01_001:333511
> blocked for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost_io01_001 state:D stack:    0
> pid:333511 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: wait_transaction_locked+0x89/0xd0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_transaction_credits+0xd4/0x290 [jbd2]
> May 16 04:50:36 oss29 kernel: ? ktime_get+0x3e/0xa0
> May 16 04:50:36 oss29 kernel: start_this_handle+0x10a/0x520 [jbd2]
> May 16 04:50:36 oss29 kernel: ? jbd2__journal_start+0x8f/0x1f0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? kmem_cache_alloc+0x13f/0x280
> May 16 04:50:36 oss29 kernel: jbd2__journal_start+0xee/0x1f0 [jbd2]
> May 16 04:50:36 oss29 kernel: ? osd_trans_start+0x13b/0x500 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: __ldiskfs_journal_start_sb+0x6e/0x140 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_trans_start+0x13b/0x500 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ofd_write_attr_set+0x11d/0x1070 [ofd]
> May 16 04:50:36 oss29 kernel: ofd_commitrw_write+0x226/0x1ad0 [ofd]
> May 16 04:50:36 oss29 kernel: ? cfs_crypto_hash_final+0x63/0x90 
> [libcfs]
> May 16 04:50:36 oss29 kernel: ? kfree+0xd3/0x250
> May 16 04:50:36 oss29 kernel: ? free_unref_page_commit+0x91/0x110
> May 16 04:50:36 oss29 kernel: ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:50:36 oss29 kernel: ? obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? enqueue_task_fair+0x96/0x5c0
> May 16 04:50:36 oss29 kernel: ? flush_work+0x42/0x1d0
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: ? _cond_resched+0x15/0x30
> May 16 04:50:36 oss29 kernel: ? mutex_lock+0xe/0x30
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
> May 16 04:50:36 oss29 kernel: INFO: task ll_ost_io01_002:333512
> blocked for more than 120 seconds.
> May 16 04:50:36 oss29 kernel:      Tainted: G        W IOE
> --------- -  - 4.18.0-425.3.1.el8_lustre.x86_64 #1
> May 16 04:50:36 oss29 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> May 16 04:50:36 oss29 kernel: task:ll_ost_io01_002 state:D stack:    0
> pid:333512 ppid:     2 flags:0x80004080
> May 16 04:50:36 oss29 kernel: Call Trace:
> May 16 04:50:36 oss29 kernel: __schedule+0x2d1/0x860
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: schedule+0x35/0xa0
> May 16 04:50:36 oss29 kernel: md_bitmap_startwrite+0x173/0x1e0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: add_stripe_bio+0x4b8/0x7e0 [raid456]
> May 16 04:50:36 oss29 kernel: raid5_make_request+0x218/0xc50 [raid456]
> May 16 04:50:36 oss29 kernel: ? generic_make_request_checks+0x2a5/0x4f0
> May 16 04:50:36 oss29 kernel: ? __bio_clone+0x60/0xe0
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: ? blk_queue_split+0xd4/0x670
> May 16 04:50:36 oss29 kernel: ? finish_wait+0x80/0x80
> May 16 04:50:36 oss29 kernel: md_handle_request+0x128/0x1b0
> May 16 04:50:36 oss29 kernel: md_make_request+0x5b/0xb0
> May 16 04:50:36 oss29 kernel: generic_make_request_no_check+0x202/0x330
> May 16 04:50:36 oss29 kernel: submit_bio+0x3c/0x160
> May 16 04:50:36 oss29 kernel: osd_do_bio.constprop.51+0xb63/0xc40 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ? __ldiskfs_journal_stop+0x36/0xb0 
> [ldiskfs]
> May 16 04:50:36 oss29 kernel: ?
> osd_ldiskfs_map_inode_pages+0x873/0x8f0 [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_ldiskfs_map_inode_pages+0x873/0x8f0
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: osd_write_commit+0x5e2/0x990 
> [osd_ldiskfs]
> May 16 04:50:36 oss29 kernel: ofd_commitrw_write+0x77e/0x1ad0 [ofd]
> May 16 04:50:36 oss29 kernel: ? cfs_crypto_hash_final+0x63/0x90 
> [libcfs]
> May 16 04:50:36 oss29 kernel: ? kfree+0xd0/0x250
> May 16 04:50:36 oss29 kernel: ofd_commitrw+0x5b4/0xd20 [ofd]
> May 16 04:50:36 oss29 kernel: ? obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: obd_commitrw+0x1b0/0x380 [ptlrpc]
> May 16 04:50:36 oss29 kernel: tgt_brw_write+0x139f/0x1ce0 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? enqueue_task_fair+0x96/0x5c0
> May 16 04:50:36 oss29 kernel: ? flush_work+0x42/0x1d0
> May 16 04:50:36 oss29 kernel: ? internal_add_timer+0x42/0x70
> May 16 04:50:36 oss29 kernel: ? _cond_resched+0x15/0x30
> May 16 04:50:36 oss29 kernel: ? mutex_lock+0xe/0x30
> May 16 04:50:36 oss29 kernel: tgt_request_handle+0xc97/0x1a40 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_nrs_req_get_nolock0+0xff/0x1f0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_server_handle_request+0x323/0xbe0 
> [ptlrpc]
> May 16 04:50:36 oss29 kernel: ptlrpc_main+0xc0f/0x1570 [ptlrpc]
> May 16 04:50:36 oss29 kernel: ? ptlrpc_wait_event+0x590/0x590 [ptlrpc]
> May 16 04:50:36 oss29 kernel: kthread+0x10a/0x120
> May 16 04:50:36 oss29 kernel: ? set_kthread_struct+0x50/0x50
> May 16 04:50:36 oss29 kernel: ret_from_fork+0x1f/0x40
>> 
> Any idea what have caused the issue?
> 
> Thanks,
> Jane
> 
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEt2qPeMT5$


More information about the lustre-discuss mailing list