[lustre-discuss] hanging threads

Mon Dec 18 16:21:04 PST 2023

Hello Ger,

Can you share the full stack trace from the log output for the hung thread? That will be helpful for diagnosing the issue. Some other clues: do you get any stack traces or error output on clients where you observe the hang? Does every client hang, or only some? Does it hang on any access to the FS at all, or only on certain files? 

When looking for such error output, it's good to check the logs during times when errors are not occurring as well, since Lustre writes a lot of messages that are "normal". If you recognize these then you can filter them out as noise when the actual problems are happening.

To diagnose if bad I/O from some particular application is causing the problem, using jobstats is very helpful. Here are some pages with information on Lustre jobstats:

https://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide
https://doc.lustre.org/lustre_manual.xhtml#jobstats

Using jobstats, you can often correlate the errors with the job(s) doing the most I/O on the filesystem at the time. It's useful to have a script periodically send jobstats output to a monitoring/logging service, so that you can compare historical data with previous errors as well. We've been able to identify many "problem apps" with bad I/O patterns this way. Of course if the problem doesn't come from a client application, but is from something else like a hardware failure, jobstats won't help identify that.

- Thomas Bertschinger

________________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of vaibhav pol via lustre-discuss <lustre-discuss at lists.lustre.org>
Sent: Monday, December 18, 2023 3:36 AM
To: Strikwerda, Ger
Cc: Lustre discussion
Subject: [EXTERNAL] Re: [lustre-discuss] hanging threads

iotop can be used to debug the I/O performance.  lfs health_check , lctl get_param to get lustre health status.
cratch-OST0084_UUID: not available for connect from 172.23.15.246 at tcp30 (no target)   indicates the network issue  check network as well.
verify the  health of the storage devices associated with OST00_036 can use smartctl.

On Mon, 18 Dec 2023 at 15:28, Strikwerda, Ger via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Dear all,

Since last week we are facing 'hanging kernel threads' causing our Lustre environment (Rocky 8.7/Lustre 2.15.2) to hang.

errors:

Dec 18 10:36:04 hb-oss01 kernel: LustreError: 137-5: scratch-OST0084_UUID: not available for connect from 172.23.15.246 at tcp30 (no target). If you are running an HA pair check that the target is mounted on the other server.
Dec 18 10:36:04 hb-oss01 kernel: LustreError: Skipped 330 previous similar messages
Dec 18 10:36:04 hb-oss01 kernel: ptlrpc_watchdog_fire: 1 callbacks suppressed
Dec 18 10:36:04 hb-oss01 kernel: Lustre: ll_ost00_036: service thread pid 85609 was inactive for 1062.652 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:

at that moment 231 jobs, not really high io. Normally we run way more jobs, and way more io.

environment is

2 MDS
4 OSS
160 OST's
250 clients

network is tcp

According to the internet, this could be caused by 'bad i/o'. Are there any useful things to check/isolate where this bad i/o is coming from? How do others pinpoint these issues?

Any feedback is very welcome,

--

Vriendelijke groet,

Ger Strikwerda
senior expert multidisciplinary enabler
simple solution architect
Rijksuniversiteit Groningen
CIT/RDMS/HPC

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276

"God is hard, God is fair
 some men he gave brains, others he gave hair"