[lustre-discuss] hanging threads

Mon Dec 18 01:57:00 PST 2023

Dear all,

Since last week we are facing 'hanging kernel threads' causing our Lustre
environment (Rocky 8.7/Lustre 2.15.2) to hang.

errors:

Dec 18 10:36:04 hb-oss01 kernel: LustreError: 137-5: scratch-OST0084_UUID:
not available for connect from 172.23.15.246 at tcp30 (no target). If you are
running an HA pair check that the target is mounted on the other server.
Dec 18 10:36:04 hb-oss01 kernel: LustreError: Skipped 330 previous similar
messages
Dec 18 10:36:04 hb-oss01 kernel: ptlrpc_watchdog_fire: 1 callbacks
suppressed
Dec 18 10:36:04 hb-oss01 kernel: Lustre: ll_ost00_036: service thread pid
85609 was inactive for 1062.652 seconds. The thread might be hung, or it
might only be slow and will resume later. Dumping the stack trace for
debugging purposes:

at that moment 231 jobs, not really high io. Normally we run way more jobs,
and way more io.

environment is

2 MDS
4 OSS
160 OST's
250 clients

network is tcp

According to the internet, this could be caused by 'bad i/o'. Are there any
useful things to check/isolate where this bad i/o is coming from? How do
others pinpoint these issues?

Any feedback is very welcome,

-- 

Vriendelijke groet,

Ger Strikwerdasenior expert multidisciplinary enabler
simple solution architect
Rijksuniversiteit Groningen
CIT/RDMS/HPC

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
 some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231218/55d496af/attachment.htm>