[lustre-discuss] [EXTERNAL] Dramatic loss of performance when another application does writing.
Mohr, Rick
mohrrf at ornl.gov
Tue Jan 13 12:01:09 PST 2026
John,
I wonder if this could be a credit issue. Do you know the size of the other job that is doing the checkpointing? It sounds like your job is just a single client job so it is going to have a limited number of credits (the default used to be 8 but I don't know if that is still the case). If the other job is using 100 nodes (just as an example), it could have 100x more outstanding IO requests than your job can. The spike in the server load makes me think that IO requests are getting backed up.
Lustre has limit on the peer_credits which is the number of outstanding IO requests per client which helps to prevent any one client from monopolizing a Lustre server. But the nodes themselves also have a limit on the total number of credits which helps to limit the number of outstanding IO requests on the server (I think the number is related to the limitations of the network fabric, but it can also serve as a way to limit the number of requests that get queued on the server to help prevent a server from getting overloaded). If a large job is checkpointing, then maybe that job is chewing up the server's credits so that your application is only getting a small number of IO requests added to a very large queue of outstanding requests. My knowledge of credits may be flawed/out-dated (and perhaps someone else on the list can correct me if I am), but it's one way that contention could exist on a server even if there isn't contention on the OSTs themselves.
If your application is using a single client which has some local SSD storage, maybe the Persistent Client Cache (PCC) feature might be of some benefit to you (if it's available on your file system).
--Rick
On 1/12/26, 7:52 PM, "lustre-discuss on behalf of John Bauer via lustre-discuss" <lustre-discuss-bounces at lists.lustre.org> wrote:
All,
My questions of recent are related to my trying to understand the following issue. I have an application that is writing, reading forwards, and reading backwards, a single file multiple times ( as seen in bottom frame of Image 1). The file is striped 4x16M on 4 ssd OSTs on 2 OSS. Everything runs along just great with transfer rates in the 5GB/s range. At some point, another application triggers approximately 135 GB of writes to each of the 32 hdd OSTs on the16 OSSs of the file system. When this happens my applications performance drops to 4.8 MB/s, a 99.9% loss of performance for the 33+ second duration of the other application's writes. My application is doing 16MB preads and pwrites in parallel using 4 pthreads, with O_DIRECT on the client. The main question I have is: "Why do the writes from the other application affect my application so dramatically?" I am making demands of the 2 OSS of about the same order of magnitude, 2.5GB/s each from 2 OSS, as the other application is getting from the same 2 OSS, about 4 GB/s each. There should be no competition for the OSTs, as I am using ssd and the other application is using hdd. If both applications are triggering Direct I/O on the OSSs, I would think there would be minimal competition for compute resources on the OSSs. But as seen below in Image 3, there is a huge spike in cpu load during the other application's writes. This is not a one-off event. I see this about 2 out of every 3 times I run this job. I suspect the other application is one that checkpoints on a regular interval, but I am a non-root user and have no way to determine. I am using PCP/pmapi to get the OSS data during my run. If the images get removed from the email, I have used alternate text with links to Dropbox for the images.
Thanks,
John
More information about the lustre-discuss
mailing list