[lustre-discuss] [EXTERNAL] O_DIRECT writes to 2nd PFL component dumps 1st PFL component from cache

John Bauer bauerj at iodoctors.com
Tue Jan 20 11:03:37 PST 2026


Andreas,

Thanks for the reply.  It was my mistake to immediately jump to the 
conclusion that it was the O_DIRECT write to the second component that 
was triggering the cache dump.  This morning I verified the same 
behavior is exhibited even if all writes are without O_DIRECT.

Thanks,

John

.

On 1/16/2026 6:17 AM, Andreas Dilger wrote:
> It is likely the revocation of the layout lock, caused by the MDS allocating the new objects that is causing the cache to be flushed?
>
> Strictly speaking, the client _shouldn't_ have to flush the cache in this case, because the OST objects in the first component have not changed, but it doesn't know this in advance so it is pre-emptively flushing the cache.
>
> Instead of ftruncate(0) you could ftruncate(final_size), or at least large enough to trigger creation of the later component(s) (they need to be created to store the file size from the truncate(size) call).
>
> Cheers, Andreas
>
>> On Jan 15, 2026, at 17:41, John Bauer via lustre-discuss<lustre-discuss at lists.lustre.org> wrote:
>>
>> Rick,
>> You were spot on.  I changed the test program to rewrite the file, with a ftruncate(0) in between.  It can be seen that the ftruncate(0) caused "cached" for all OSCs drops to zero at about 1.4 seconds.  The subsequent rewrite does not dump the first component when a direct write  goes to the 2nd component.
>> Thanks much for the insight.
>> John<split_direct_2.png>
>> On 1/15/2026 5:31 PM, Mohr, Rick wrote:
>>> John,
>>>
>>> Have you run the same test a second time against the same file (ie - overwriting data from the first test so that a new file isn't allocated by lustre)? If so, do you see the same behavior both times? The reason I ask is because I am wondering if this could be related to lustre's lazy allocation of the second PFL component. Lustre will only allocate osts for the first component when the file is created, but as soon as you attempt to write into the second component, Lustre will then allocate a set of osts for it. Maybe there is some locking that happens which forces the client to flush its cache? It's just a guess but it might be worth testing if you haven't already done so.
>>>
>>> --Rick
>>>
>>>
>>> On 1/15/26, 3:43 PM, "lustre-discuss on behalf of John Bauer via lustre-discuss"<lustre-discuss-bounces at lists.lustre.org> wrote:
>>>
>>> All,
>>> I am back to trying to emulate Hybrid I/O from user space, doing direct and buffered I/O to the same file concurrently. I open a file twice, once with O_DIRECT, and once without. Note that you will see 2 different file names involved, buffered.dat and direct.dat. direct.dat is a symlink to buffered.dat and this is done so my tool can more easily display the direct and non-direct I/O differently. The file has striping of 512M at 4{100,101,102,103}x32M<ssd-pool + EOF at 4{104,105,106,107}x32M<ssd-pool. The application first writes 512M ( 32M per write ) to only the first PFL component using non-direct fd. Then the application writes 512M ( 32M per write ) alternating between the direct fd and non-direct fd. The very first write ( using direct ) into the 2nd component triggers the dump of the entire first component from buffer cache. From that point on the 2 OSC that handle the non-direct writes accumulate cache. The 2 OSC that handle the direct writes accumulate no cache. My question: Why does Lustre dump the 1st component from buffer cache? The 1st and 2nd component do not even share OSCs. Lustre is has no problem dealing with direct and non-direct I/O in the same component (2nd component in this case). To me it would seem that if Lustre can correctly buffer direct and non-direct in the same component, it should be able to correctly buffer direct and non-direct in multiple components. My ultimate goal is to have the first, and smaller component, remain cached, and the remainder of the file use direct I/O, but as soon as I do a direct I/O, I lose all my buffer cache.
>>> The top frame of the plot is the amount of cache used by each OSC versus time. The bottom frame of the plot is the File Position Activity versus time. Next to each pwrite64() depicted, I indicate which OSC is being written to. I have also colored the pwrite64()s by whether they used the direct fd (green) or non-direct fd(red). As soon as the 2nd PFL component is touched by a direct write, that write waits until the OSCs of the first PFL component dump all their cache.
>>> John
>>>
>>>
>>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ---
> Andreas Dilger
> Principal Lustre Architect
> adilger at thelustrecollective.com
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20260120/988be788/attachment.htm>


More information about the lustre-discuss mailing list