[lustre-discuss] Dramatic loss of performance when another application does writing.
E.S. Rosenberg
esr+lustre at mail.hebrew.edu
Tue Jan 13 06:12:43 PST 2026
Hey John,
I am sure that the great experts on the list will have a better answer but
in the mean time could it be that your MDS is unable to get you the
pointers for your next read fast enough because it is busy writing lots of
metadata?
HTH,
Eliyahu - אליהו
On Tue, Jan 13, 2026 at 2:50 AM John Bauer via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:
> All,
>
> My questions of recent are related to my trying to understand the
> following issue. I have an application that is writing, reading forwards,
> and reading backwards, a single file multiple times ( as seen in bottom
> frame of Image 1). The file is striped 4x16M on 4 ssd OSTs on 2 OSS.
> Everything runs along just great with transfer rates in the 5GB/s range.
> At some point, another application triggers approximately 135 GB of writes
> to each of the 32 hdd OSTs on the16 OSSs of the file system. When this
> happens my applications performance drops to 4.8 MB/s, a 99.9% loss of
> performance for the 33+ second duration of the other application's
> writes. My application is doing 16MB preads and pwrites in parallel using
> 4 pthreads, with O_DIRECT on the client. The main question I have is:
> "Why do the writes from the other application affect my application so
> dramatically?" I am making demands of the 2 OSS of about the same order of
> magnitude, 2.5GB/s each from 2 OSS, as the other application is getting
> from the same 2 OSS, about 4 GB/s each. There should be no competition for
> the OSTs, as I am using ssd and the other application is using hdd. If
> both applications are triggering Direct I/O on the OSSs, I would think
> there would be minimal competition for compute resources on the OSSs. But
> as seen below in Image 3, there is a huge spike in cpu load during the
> other application's writes. This is not a one-off event. I see this about
> 2 out of every 3 times I run this job. I suspect the other application is
> one that checkpoints on a regular interval, but I am a non-root user and
> have no way to determine. I am using PCP/pmapi to get the OSS data during
> my run. If the images get removed from the email, I have used alternate
> text with links to Dropbox for the images.
>
> Thanks,
>
> John
>
> Image 1:
>
> [image:
> https://www.dropbox.com/scl/fi/kih8qf6byl3bi5gc9r296/floorpan_oss_pause.png?rlkey=0o00o7x3oaw24h3cl3dyxyb2p&st=wahbm0gg&dl=0]
>
>
> Image2:
>
> [image:
> https://www.dropbox.com/scl/fi/e36jjoomqa3xkadcyhdw9/disk_V_RTC.png?rlkey=ujzx02n3us42ga9prsxm5dbkh&st=ato9s3gj&dl=0]
>
>
> Image 3:
>
> [image:
> https://www.dropbox.com/scl/fi/bzudgnwnecvkp3ra4kjvp/kernelAllLoad.png?rlkey=fni6lv4zwbt53aprg6twjmnsv&st=sy9expz6&dl=0]
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260113/2e9bc890/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_oss_pause.png
Type: image/png
Size: 94549 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260113/2e9bc890/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: disk_V_RTC.png
Type: image/png
Size: 152329 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260113/2e9bc890/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kernelAllLoad.png
Type: image/png
Size: 40935 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260113/2e9bc890/attachment-0005.png>
More information about the lustre-discuss
mailing list