[lustre-discuss] Read/Write on specific stripe of file via C api

Wed Dec 11 02:10:46 PST 2024

Hello,

Coming back to this, I have proceeded with the one-file approach.
I am using a toy cluster with 1 combined MGS/MDT, 4 OSTs and 4 clients, 
each client handling a different section of the file in parallel. 
Clients are running containerized in the same VM as the OST.
The file is striped across all 4 OSTs and a stripe size of 1MB is used 
(unless mentioned otherwise).
I am using different file sizes to measure performance, ranging from 
~50MB to ~2.5GB. I am measuring end to end times for reading/writing the 
file.

I have performed the following experiments:

A) Using a variable size buffer of sizes 1MB, 2MB, 4MB to perform 
read/write calls.
B) To try and see if stripe alignment is beneficial, I aligned 
read/write calls so that they only handle one stripe. If I understand 
correctly, this means that each call is in the form `pwrite(fd, buffer, 
size, offset)` (same for pread), where offset is a multiple of 
stripe_size and size=stripe_size (buffer size = stripe_size). For this, 
stripe_size = buffer size =  1MB is used.
C) Without taking care of stripe alignment and a buffer of 1MB, try to 
determine if stripe_size is important by experimenting with the values 
stripe_size=65536, 655360, 6553600, 1MB.

For a given file size, the results are almost identical for both read 
and write across all my experiments.

My questions are:

Q1) Is the way I am trying to align calls with stripes (and in effect 
make sure each call only needs one OST) correct ?
Q2) If it is indeed correct, is it expected that I don't see any 
difference when aligning calls with stripes vs when I am not ? Based on 
our discussion and best practices I found online, I would expect that 
when alignment is taken into consideration performance is better.
Q3) Is it expected that I don't see any difference in performance using 
variable stripe sizes (with fixed size of read/write operations, namely 
1MB) ?
Q4) Is it expected that I don't see any difference in performance using 
variable size of read/write operations (with fixed stripe_size 1MB) ?
Q4) If the parameters mentioned should indeed affect performance, any 
idea what the reason might be that in my setup no difference is 
observed? E.g. I was thinking that MGS/MDT node could be slow and thus a 
bottleneck, or the files are too small to see any significant difference 
etc.

Any additional things I might be missing to better understand what is 
going on?

Thanks again for the help,

Apostolis

On 12/10/24 23:30, Andreas Dilger wrote:
> On Sep 30, 2024, at 13:26, Apostolis Stamatis <el18034 at mail.ntua.gr> 
> wrote:
>>
>> Thank you very much Andreas.
>>
>> Your explanation was very insightful.
>>
>> I do have the following questions/thoughts:
>>
>> Let's say I have 2 available OSTs, and 4MB of data. The stripe-size 
>> is 1MB. (Sizes are small for discussion purposes, I am trying to 
>> understand what solution -if any- would perform better in general)
>>
>> I would like to compare the following two strategies of 
>> writing/reading the data:
>>
>> A) I can store all the data in 1 single big lustre file, striped 
>> across the 2 OSTs.
>>
>> B) I can create (e.g.) 4  smaller lustre files, each consisting of 
>> 1MB of data. Suppose I place them manually in the same way that they 
>> would be striped on strategy A.
>>
>> So the only difference between the 2 strategies is whether data is in 
>> a single lustre file or not (meaning I make sure each OST has a 
>> similar load in both cases).
>>
>> Then:
>>
>> Q1. Suppose I have 4 simultaneous processes, each wanting to read 1MB 
>> of data. On strategy A, each process opens the file (via 
>> llapi_file_open) and then reads the corresponding data by calculating 
>> the offset from the start. On strategy B each process simply opens 
>> the corresponding file and reads its data. Would there be any 
>> difference in performance between the two strategies ?
>>
> For reading it is unlikely that there would be a significant 
> difference in performance.  For writing, option A would be somewhat 
> slower than B for large amounts of data, because there would be some 
> lock contention between parallel writers to the same file.
>
> However, if this behavior is expanded to a large scale, then having 
> millions or billions of 1MB files would have a different kind of 
> overhead to open/close each file separately and having to manage so 
> many those files vs. having fewer/larger files.  Given that a single 
> client can read/write GB/s, it makes sense to aggregate enough data 
> per file to amortize the overhead of the lookup/open/stat/close.
>
> Large-scale HPC applications try to pick a middle ground, for example 
> having 1 file per checkpoint timestep written in parallel (instead of 
> 1M separate per-CPU files), but each timestep (hourly) has a different 
> file.  Alternately, each timestep could write individual files into a 
> separate directory, if they are reasonably large (e.g. GB).
>
>> Q2. Suppose I have 1 process, wanting to read the (e.g.)  3rd MB of 
>> data. Would strategy B be better, since it avoids the overhead of 
>> "skipping" to the offset that is required in strategy A ?
>>
> Seeking the offset pointer within a file has no cost.  That is just 
> changing a number in the open file descriptor on the client, so it 
> doesn't involve the servers or any kind of locking.
>
>> Q3. For question 2, would the answer be different if the read is not 
>> aligned to the stripe-size? Meaning that in both strategies I would 
>> have to skip to an offset (compared to Q2 where I could just read the 
>> whole file in strategy B from the start), but in strategy A the skip 
>> is bigger.
>>
> Same answer as 2 - the seeking itself has no cost.  The *read* of 
> unaligned data in this case is likely to be somewhat slower than 
> reading aligned data (it may send RPCs to two OSTs, needing two 
> separate locks, etc).  However, with any large-sized read (e.g. 8 MB+) 
> it is unlikely to make a significant difference.
>
>> Q4. One concern I have regarding strategy A is that all the stripes 
>> of the file that are in the same OST are seen -internally- as one 
>> object (as per "Understanding Lustre Internals"). Does this affect 
>> performance when different, but not overlapping, parts of the file 
>> (that are on the same OST) are being accessed (for example due to 
>> locking)? Does it matter if the parts being accessed are on different 
>> "chunk", e.g 1st and 3rd MB on the above example?
>>
>
> No, Lustre can allow concurrent read access to a single object from 
> multiple threads/clients.  When writing the file, there can also be 
> concurrent write access to a single object, but only with 
> non-overlapping regions.  That would also be true if writing to 
> separate files in option B (contention if two processes tried to write 
> the same small file).
>
>> Also if there are any additional docs I can read on those topics 
>> (apart from "Understanding Lustre internals") to get a better 
>> understanding, please do point them out.
>>
> Patrick Farrell has presented at LAD and LUG a few times about 
> optimizations to the IO pipeline, which may be interesting:
> https://wiki.lustre.org/Lustre_User_Group_2022
> - https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf
> https://www.eofs.eu/index.php/events/lad-23/
> - 
> https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf
> https://wiki.lustre.org/Lustre_User_Group_2024
> - 
> https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf
>
>> Thanks again for your help,
>>
>> Apostolis
>>
>>
>> On 9/23/24 00:42, Andreas Dilger wrote:
>>> On Sep 18, 2024, at 10:47, Apostolis Stamatis <el18034 at mail.ntua.gr> 
>>> wrote:
>>>> I am trying to read/write a specific stripe for files striped 
>>>> across multiple OSTs. I've been looking around the C api but with 
>>>> no success so far.
>>>>
>>>>
>>>> Let's say I have a big file which is striped across multiple OSTs. 
>>>> I have a cluster of compute nodes which perform some computation on 
>>>> the data of the file. Each node needs only a subset of that data.
>>>>
>>>> I want each node to be able to read/write only the needed 
>>>> information, so that all reads/writes can happen in parallel. The 
>>>> desired data may or may not be aligned with the stripes (this is 
>>>> secondary).
>>>>
>>>> It is my understanding that stripes are just parts of the file. 
>>>> Meaning that if I have an array of 100 rows and stripe A contains 
>>>> the first half, then it would contain the first 50 rows, is this 
>>>> correct?
>>>
>>> This is not totally correct.  The location of the data depends on 
>>> the size of the data and the stripe size.
>>>
>>> For a 1-stripe file (the default unless otherwise specified) then 
>>> all of the data would be in a single object, regardless of the size 
>>> of the data.
>>>
>>> For a 2-stripe file with stripe_size=1MiB, then the first MB of data 
>>> [0-1MB) is on object 0, the second MB of data [1-2MB) is on object 
>>> 1, and the third MB of data [2-3MB) is back on object 0, etc.
>>>
>>> See 
>>> https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts for 
>>> example.
>>>
>>>> To sum up my questions are:
>>>>
>>>> 1) Can I read/write a specific stripe of a file via the C api to 
>>>> achieve better performance/locality?
>>>
>>> There is no Lustre llapi_* interface that provides this 
>>> functionality, but you can of course read the file with regular 
>>> read() or preferably pread() or readv() calls with the right file 
>>> offsets.
>>>
>>>> 2) Is it correct that stripes include parts of the file, meaning 
>>>> the raw data? If not, can the raw data be extracted from any 
>>>> additional information stored in the stripe?
>>>
>>> For example, if you have a 4-stripe file, then the application 
>>> should read every 4th MB of the file to stay on the same OST object. 
>>> Note that the *OST* index is not necessarily the same as the 
>>> *stripe* number of the file.  To read the file from the local OST 
>>> then it should check the local OST index and select that OST index 
>>> from the file to determine the offset from the start of the file = 
>>> stripe_size * stripe_number.
>>>
>>> However, you could also do this more easily by having a bunch of 
>>> 1-stripe files and doing the reads directly on the local OSTs.  You 
>>> would run "lfs find DIR -i LOCAL_OST_IDX" to get a list of the files 
>>> on each OST, and then process them directly.
>>>
>>>> 3) If each compute node is run on top of a different OST where 
>>>> stripes of the file are stored, would it be better in terms of 
>>>> performance to have the node read the stripe of its OST? (because 
>>>> e.g. it avoids data transfer over the network)
>>>
>>> This is not necessarily needed, if you have a good network, but it 
>>> depends on the workload.  Local PCI storage access is about the same 
>>> speed as remote PCI network access because they are limited by the 
>>> PCI bus bandwidth.  You would notice a difference is if you have a 
>>> large number of clients and they are completely IO-bound that 
>>> overwhelm the storage.
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Lustre Principal Architect
>>> Whamcloud
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20241211/6e0076e5/attachment-0001.htm>