[lustre-discuss] Dramatic loss of performance when another application does writing.
Andreas Dilger
adilger at thelustrecollective.com
Tue Jan 13 16:09:48 PST 2026
[consolidated a few threads in my reply here]
On Jan 13, 2026, at 07:12, E.S. Rosenberg <esr+lustre at mail.hebrew.edu <mailto:esr+lustre at mail.hebrew.edu>> wrote:
>
> Hey John,
>
> I am sure that the great experts on the list will have a better answer but in the mean time could it be that your MDS is unable to get you the pointers for your next read fast enough because it is busy writing lots of metadata?
The MDS is mostly not involved in IO. Unlike filesystems like Ceph or GPFS, where blocks/chunks are continually allocated in small increments (32MiB) at some globally-visible level during writes and spread across all of the OSDs/VSDs, in Lustre the MDS is only doing OST object allocation once when the file is opened, and maybe a couple more times at component boundaries for PFL files. During reads the files have all of the file layout information from the beginning and never need to contact the MDS again.
The low-level details of block allocation for Lustre are managed within the OSS/OST itself, and the client only needs to have the OST object FID(s) and offset to access it, and never sees the actual blocks.
On Jan 13, 2026, at 13:01, Mohr, Rick <mohrrf at ornl.gov <mailto:mohrrf at ornl.gov>> wrote:
> I wonder if this could be a credit issue. Do you know the size of the other job that is doing the checkpointing? It sounds like your job is just a single client job so it is going to have a limited number of credits (the default used to be 8 but I don't know if that is still the case). If the other job is using 100 nodes (just as an example), it could have 100x more outstanding IO requests than your job can. The spike in the server load makes me think that IO requests are getting backed up.
It could be the network credits, but it might also just be the number of RPCs involved that are causing all of the OST threads to be busy. If the large job is sending 100x as many IO RPCs and the HDD OSTs have limited IOPS to service them, then the OSS IO threads could all become busy and there will be long queues to complete each of the SSD job's RPCs (even though they are for "idle" OSTs) and lots of waiting on the HDD RPCs to complete until a thread becomes available. Even then, the volume of HDD OST RPCs is likely enough to drown out the SSD OST RPCs, so each SSD RPC will need to wait for multiple HDD RPCs to finish.
> If your application is using a single client which has some local SSD storage, maybe the Persistent Client Cache (PCC) feature might be of some benefit to you (if it's available on your file system).
PCC-RW is not production ready at the current time, though PCC-RO is usable with EXAScaler if there is client-local NVMe storage and the workload is read-mostly. It isn't clear if that is useful here if there continue to be writes to the same files during the job (which would invalidate the PCC-RO copy on the client each time). If it's a read-only workload on the files after initial write, and the writes are done to separate files, then PCC-RO might be viable.
> On Tue, Jan 13, 2026 at 2:50 AM John Bauer <bauerj at iodoctors.com> wrote:
>> All,
>> My questions of recent are related to my trying to understand the following issue. I have an application that is writing, reading forwards, and reading backwards, a single file multiple times ( as seen in bottom frame of Image 1). The file is striped 4x16M on 4 ssd OSTs on 2 OSS. Everything runs along just great with transfer rates in the 5GB/s range. At some point, another application triggers approximately 135 GB of writes to each of the 32 hdd OSTs on the16 OSSs of the file system. When this happens my applications performance drops to 4.8 MB/s, a 99.9% loss of performance for the 33+ second duration of the other application's writes. My application is doing 16MB preads and pwrites in parallel using 4 pthreads, with O_DIRECT on the client. The main question I have is: "Why do the writes from the other application affect my application so dramatically?"
The OSS RPC processing is done one RPC per thread, and there is not a dedicated RPC pool for each OST. This is less of an issue for same-technology OSTs on a single OSS, since OSTs will typically have similar load and performance characteristics, but mixed-technology OSTs (and is fairly common in production) can expose fairness issues as seen here.
I think this view is supported by the increase in *load* on the OSS nodes. Note that this is not necessarily *CPU* usage, but rather just +1 *load* for every RPC currently being processed even if most of those OSS threads are blocked waiting on HDD IO completion. The OSS load itself is not very high (only 32), which means that the max service thread count is probably set too low? The slope on the load graph trailing edge is because it is showing the 5-minute (300s) decaying load average from "uptime", even though the IO peak is only ~30s.
A discussion in https://jira.whamcloud.com/browse/LU-14564 (and possibly other tickets I can't find) was around how to make the OST thread count more dynamic, and/or bind OSS IO service threads to specific OSTs to handle this kind of situation better, but has not been implemented.
I think there are some potential shorter-term solutions here (both of which would need admin intervention), but could potentially be deployed with minimal disruption:
- increase ost.OSS.ost_io.threads_max over 32 to start more threads
on each OSS. Typically these are in the hundreds. This should
help to some extent, as long as the thread count is high enough
to allow some flash RPCs handling above the HDD RPCs in flight.
- increase osc.FSNAME-OST[flash].max_rpcs_in_flight and/or
decrease osc.FSNAME-OST[hdd].max_rpcs_in_flight (which can be
set on the client side) in addition or instead of increased OSS
thread count so that flash-using applications can get more RPCs
in flight to grab a bigger share of thread time.
- use NRS/TBF (or NRS/ORR) to balance the HDD OST writer against
the flash OST RPCs. However, this is more complex to implement,
since there is no direct "per OST" tunable (it depends mostly
on client RPC attributes like UID, GID, PROJID, NID, Job, etc.
Tuning NRS parameters also depend on the actual speeds of the
underlying storage devices, etc. Making NRS/TBF easier to use
(and enabled out of the box) is one area that I definitely think
could use some attention.
I don't think "lfs ladvise willread FILE" would help here, since I don't think this is a cache or IO contention issue (since the OSTs are separate devices). That would be useful if e.g. your application was also doing lots of small/random reads to the HDD OSTs and the HDDs themselves are overloaded by the other application.
It would be useful to file an LU ticket for this with the details from your email, and link it to LU-14564, since it seems like a somewhat separate issue (independent OSTs vs. DLM lock contention). It _might_ be the solution is the same, some kind of resource scheduling for the RPCs/threads to avoid false contention between RPCs, but it isn't clear whether just increasing the thread count is the right way to fix this.
Cheers, Andreas
>> I am making demands of the 2 OSS of about the same order of magnitude, 2.5GB/s each from 2 OSS, as the other application is getting from the same 2 OSS, about 4 GB/s each. There should be no competition for the OSTs, as I am using ssd and the other application is using hdd. If both applications are triggering Direct I/O on the OSSs, I would think there would be minimal competition for compute resources on the OSSs. But as seen below in Image 3, there is a huge spike in cpu load during the other application's writes. This is not a one-off event. I see this about 2 out of every 3 times I run this job. I suspect the other application is one that checkpoints on a regular interval, but I am a non-root user and have no way to determine. I am using PCP/pmapi to get the OSS data during my run. If the images get removed from the email, I have used alternate text with links to Dropbox for the images.
>> Thanks,
>> John
>> Image 1:
>> Image2:
>> Image 3:
---
Andreas Dilger
Principal Lustre Architect
adilger at thelustrecollective.com
[cid:b9c191c1-6081-43f0-ba46-382276c5f0af at namprd04.prod.outlook.com]
[cid:dbd823e8-fa2a-496d-8e30-4de1af8b2880 at namprd04.prod.outlook.com]
[cid:81d1e539-ac32-42dd-ad8c-37c35a5e36bb at namprd04.prod.outlook.com]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260114/e5fa3702/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: disk_V_RTC.png
Type: image/png
Size: 446512 bytes
Desc: disk_V_RTC.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260114/e5fa3702/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kernelAllLoad.png
Type: image/png
Size: 188541 bytes
Desc: kernelAllLoad.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260114/e5fa3702/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: floorpan_oss_pause.png
Type: image/png
Size: 298827 bytes
Desc: floorpan_oss_pause.png
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260114/e5fa3702/attachment-0005.png>
More information about the lustre-discuss
mailing list