[lustre-discuss] hanging threads
Strikwerda, Ger
g.j.c.strikwerda at rug.nl
Mon Dec 18 01:57:00 PST 2023
Dear all,
Since last week we are facing 'hanging kernel threads' causing our Lustre
environment (Rocky 8.7/Lustre 2.15.2) to hang.
errors:
Dec 18 10:36:04 hb-oss01 kernel: LustreError: 137-5: scratch-OST0084_UUID:
not available for connect from 172.23.15.246 at tcp30 (no target). If you are
running an HA pair check that the target is mounted on the other server.
Dec 18 10:36:04 hb-oss01 kernel: LustreError: Skipped 330 previous similar
messages
Dec 18 10:36:04 hb-oss01 kernel: ptlrpc_watchdog_fire: 1 callbacks
suppressed
Dec 18 10:36:04 hb-oss01 kernel: Lustre: ll_ost00_036: service thread pid
85609 was inactive for 1062.652 seconds. The thread might be hung, or it
might only be slow and will resume later. Dumping the stack trace for
debugging purposes:
at that moment 231 jobs, not really high io. Normally we run way more jobs,
and way more io.
environment is
2 MDS
4 OSS
160 OST's
250 clients
network is tcp
According to the internet, this could be caused by 'bad i/o'. Are there any
useful things to check/isolate where this bad i/o is coming from? How do
others pinpoint these issues?
Any feedback is very welcome,
--
Vriendelijke groet,
Ger Strikwerdasenior expert multidisciplinary enabler
simple solution architect
Rijksuniversiteit Groningen
CIT/RDMS/HPC
Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20231218/55d496af/attachment.htm>
More information about the lustre-discuss
mailing list