<div dir="ltr"><p>Hello Lustre community,</p>
<p>I've searched through the documentation and various forums but haven’t found a clear solution for this issue.</p>
<p>We have a Lustre setup with 10 OSS nodes, each hosting 3 OSTs. Occasionally, one of the OSTs becomes unresponsive, and we’re forced to reboot the corresponding OSS to restore functionality. The logs show an error like:</p><pre class="gmail-overflow-visible!"><div class="gmail-contain-inline-size gmail-rounded-md gmail-border-[0.5px] gmail-border-token-border-medium gmail-relative gmail-bg-token-sidebar-surface-primary"><div class="gmail-overflow-y-auto gmail-p-4" dir="ltr"><code class="gmail-whitespace-pre!">bulk IO <span class="gmail-hljs-built_in">read</span> <span class="gmail-hljs-built_in">error</span> with <span class="gmail-hljs-number">5</span>c06fsdf-xxxxxxx, client will retry, rc=<span class="gmail-hljs-number">-110</span>
</code></div></div></pre>
<p>This Lustre filesystem is primarily used for SLURM jobs running AI/ML workloads.</p>
<p>I’m trying to identify which SLURM job or user is initiating high I/O operations that could be causing these hangs, so that we can investigate or temporarily stop that user/job. I’ve tried setting the job ID tracking with:</p><pre class="gmail-overflow-visible!"><div class="gmail-contain-inline-size gmail-rounded-md gmail-border-[0.5px] gmail-border-token-border-medium gmail-relative gmail-bg-token-sidebar-surface-primary"><div class="gmail-overflow-y-auto gmail-p-4" dir="ltr"><code class="gmail-whitespace-pre!"><span class="gmail-hljs-attribute">lctl</span> set_param -P jobid_var=SLURM_JOB_ID
</code></div></div></pre>
<p>But it doesn't seem to be working as expected.</p>
<p>Does anyone have a reliable method for identifying SLURM users or jobs responsible for high I/O operations on Lustre?, and how can i mitigate the Hang OSTs</p>
<p>Any insights or suggestions would be greatly appreciated. If further details are required, I am at your disposal. </p><p><br></p><p>Regards,</p><p><br></p><p>Ihsan Ur Rahman </p></div>