[lustre-discuss] Identifying High I/O Jobs Causing OST to Hang
Andreas Dilger
adilger at ddn.com
Tue May 27 17:04:09 PDT 2025
On May 27, 2025, at 14:12, Ihsan Ur Rahman via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
Hello Lustre community,
I've searched through the documentation and various forums but haven’t found a clear solution for this issue.
We have a Lustre setup with 10 OSS nodes, each hosting 3 OSTs. Occasionally, one of the OSTs becomes unresponsive, and we’re forced to reboot the corresponding OSS to restore functionality. The logs show an error like:
bulk IO read error with 5c06fsdf-xxxxxxx, client will retry, rc=-110
This Lustre filesystem is primarily used for SLURM jobs running AI/ML workloads.
I’m trying to identify which SLURM job or user is initiating high I/O operations that could be causing these hangs, so that we can investigate or temporarily stop that user/job. I’ve tried setting the job ID tracking with:
lctl set_param -P jobid_var=SLURM_JOB_ID
But it doesn't seem to be working as expected.
Can you provide some details of what isn't working? Does the "jobid_name" variable contain "%j" to include the jobid from SLURM_JOB_ID?
Does anyone have a reliable method for identifying SLURM users or jobs responsible for high I/O operations on Lustre?, and how can i mitigate the Hang OSTs
Any insights or suggestions would be greatly appreciated. If further details are required, I am at your disposal.
If you have the JobStats working, you could try using the "lljobstat" tool to monitor the jobs doing the most RPCs to single server node.
Cheers, Andreas
—
Andreas Dilger
Lustre Principal Architect
Whamcloud/DDN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250528/6c463e60/attachment.htm>
More information about the lustre-discuss
mailing list