<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hey John,</div><div><br></div><div>I am sure that 

the great experts on the list will have a better answer but in the mean 

time could it be that your MDS is unable to get you the pointers for 

your next read fast enough because it is busy writing lots of metadata?</div><div><br></div><div>HTH,</div><div>Eliyahu - אליהו</div></div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Tue, Jan 13, 2026 at 2:50 AM John Bauer via lustre-discuss <<a href="mailto:lustre-discuss@lists.lustre.org">lustre-discuss@lists.lustre.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>


  <div>

    <p>All,</p>

    <p>My questions of recent are related to my trying to understand the

      following issue.  I have an application that is writing, reading

      forwards, and reading backwards, a single file multiple times ( as

      seen in bottom frame of Image 1).  The file is striped 4x16M on 4

      ssd OSTs on 2 OSS.  Everything runs along just great with transfer

      rates in the 5GB/s range.  At some point, another application

      triggers  approximately 135 GB of writes to each of the 32 hdd

      OSTs on the16 OSSs of the file system.  When this happens my

      applications performance drops to 4.8 MB/s, a 99.9% loss of

      performance for the 33+ second duration of the other

      application's  writes.  My application is doing 16MB preads and

      pwrites in parallel using 4 pthreads,  with O_DIRECT on the

      client.  The main question I have is: "Why do the writes from the

      other application affect my application so dramatically?" I am

      making demands of the 2 OSS of about the same order of magnitude,

      2.5GB/s each from 2 OSS, as the other application is getting from

      the same 2 OSS, about 4 GB/s each.  There should be no competition

      for the OSTs, as I am using ssd and the other application is using

      hdd.  If both applications are triggering Direct I/O on the OSSs,

      I would think there would be minimal competition for compute

      resources on the OSSs.  But as seen below in Image 3, there is a

      huge spike in cpu load during the other application's writes. 

      This is not a one-off event.  I see this about 2 out of every 3

      times I run this job.  I suspect the other application is one that

      checkpoints on a regular interval, but I am a non-root user and

      have no way to determine.  I am using PCP/pmapi to get the OSS

      data during my run.  If the images get removed from the email, I

      have used alternate text with links to Dropbox for the images.</p>

    <p>Thanks,</p>

    <p>John</p>

    <p><font size="6">Image 1:</font></p>

    <p><img src="cid:ii_19bb7b33ca0d9f7da531" alt="https://www.dropbox.com/scl/fi/kih8qf6byl3bi5gc9r296/floorpan_oss_pause.png?rlkey=0o00o7x3oaw24h3cl3dyxyb2p&st=wahbm0gg&dl=0" width="1354" height="870"></p>

    <p><br>

    </p>

    <p><font size="6">Image2:</font></p>

    <p><img src="cid:ii_19bb7b33ca1466f29f12" alt="https://www.dropbox.com/scl/fi/e36jjoomqa3xkadcyhdw9/disk_V_RTC.png?rlkey=ujzx02n3us42ga9prsxm5dbkh&st=ato9s3gj&dl=0" width="1378" height="872"></p>

    <p><font size="6"><br>

      </font></p>

    <p><font size="6">Image 3:</font></p>

    <p><img src="cid:ii_19bb7b33ca1fae86d653" alt="https://www.dropbox.com/scl/fi/bzudgnwnecvkp3ra4kjvp/kernelAllLoad.png?rlkey=fni6lv4zwbt53aprg6twjmnsv&st=sy9expz6&dl=0" width="1362" height="877"></p>

  </div>

_______________________________________________<br>

lustre-discuss mailing list<br>

<a href="mailto:lustre-discuss@lists.lustre.org" target="_blank">lustre-discuss@lists.lustre.org</a><br>

<a href="http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org" rel="noreferrer" target="_blank">http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</a><br>

</blockquote></div>