[lustre-discuss] Read/Write on specific stripe of file via C api
Apostolis Stamatis
el18034 at mail.ntua.gr
Wed Dec 11 02:10:46 PST 2024
Hello,
Coming back to this, I have proceeded with the one-file approach.
I am using a toy cluster with 1 combined MGS/MDT, 4 OSTs and 4 clients,
each client handling a different section of the file in parallel.
Clients are running containerized in the same VM as the OST.
The file is striped across all 4 OSTs and a stripe size of 1MB is used
(unless mentioned otherwise).
I am using different file sizes to measure performance, ranging from
~50MB to ~2.5GB. I am measuring end to end times for reading/writing the
file.
I have performed the following experiments:
A) Using a variable size buffer of sizes 1MB, 2MB, 4MB to perform
read/write calls.
B) To try and see if stripe alignment is beneficial, I aligned
read/write calls so that they only handle one stripe. If I understand
correctly, this means that each call is in the form `pwrite(fd, buffer,
size, offset)` (same for pread), where offset is a multiple of
stripe_size and size=stripe_size (buffer size = stripe_size). For this,
stripe_size = buffer size = 1MB is used.
C) Without taking care of stripe alignment and a buffer of 1MB, try to
determine if stripe_size is important by experimenting with the values
stripe_size=65536, 655360, 6553600, 1MB.
For a given file size, the results are almost identical for both read
and write across all my experiments.
My questions are:
Q1) Is the way I am trying to align calls with stripes (and in effect
make sure each call only needs one OST) correct ?
Q2) If it is indeed correct, is it expected that I don't see any
difference when aligning calls with stripes vs when I am not ? Based on
our discussion and best practices I found online, I would expect that
when alignment is taken into consideration performance is better.
Q3) Is it expected that I don't see any difference in performance using
variable stripe sizes (with fixed size of read/write operations, namely
1MB) ?
Q4) Is it expected that I don't see any difference in performance using
variable size of read/write operations (with fixed stripe_size 1MB) ?
Q4) If the parameters mentioned should indeed affect performance, any
idea what the reason might be that in my setup no difference is
observed? E.g. I was thinking that MGS/MDT node could be slow and thus a
bottleneck, or the files are too small to see any significant difference
etc.
Any additional things I might be missing to better understand what is
going on?
Thanks again for the help,
Apostolis
On 12/10/24 23:30, Andreas Dilger wrote:
> On Sep 30, 2024, at 13:26, Apostolis Stamatis <el18034 at mail.ntua.gr>
> wrote:
>>
>> Thank you very much Andreas.
>>
>> Your explanation was very insightful.
>>
>> I do have the following questions/thoughts:
>>
>> Let's say I have 2 available OSTs, and 4MB of data. The stripe-size
>> is 1MB. (Sizes are small for discussion purposes, I am trying to
>> understand what solution -if any- would perform better in general)
>>
>> I would like to compare the following two strategies of
>> writing/reading the data:
>>
>> A) I can store all the data in 1 single big lustre file, striped
>> across the 2 OSTs.
>>
>> B) I can create (e.g.) 4 smaller lustre files, each consisting of
>> 1MB of data. Suppose I place them manually in the same way that they
>> would be striped on strategy A.
>>
>> So the only difference between the 2 strategies is whether data is in
>> a single lustre file or not (meaning I make sure each OST has a
>> similar load in both cases).
>>
>> Then:
>>
>> Q1. Suppose I have 4 simultaneous processes, each wanting to read 1MB
>> of data. On strategy A, each process opens the file (via
>> llapi_file_open) and then reads the corresponding data by calculating
>> the offset from the start. On strategy B each process simply opens
>> the corresponding file and reads its data. Would there be any
>> difference in performance between the two strategies ?
>>
> For reading it is unlikely that there would be a significant
> difference in performance. For writing, option A would be somewhat
> slower than B for large amounts of data, because there would be some
> lock contention between parallel writers to the same file.
>
> However, if this behavior is expanded to a large scale, then having
> millions or billions of 1MB files would have a different kind of
> overhead to open/close each file separately and having to manage so
> many those files vs. having fewer/larger files. Given that a single
> client can read/write GB/s, it makes sense to aggregate enough data
> per file to amortize the overhead of the lookup/open/stat/close.
>
> Large-scale HPC applications try to pick a middle ground, for example
> having 1 file per checkpoint timestep written in parallel (instead of
> 1M separate per-CPU files), but each timestep (hourly) has a different
> file. Alternately, each timestep could write individual files into a
> separate directory, if they are reasonably large (e.g. GB).
>
>> Q2. Suppose I have 1 process, wanting to read the (e.g.) 3rd MB of
>> data. Would strategy B be better, since it avoids the overhead of
>> "skipping" to the offset that is required in strategy A ?
>>
> Seeking the offset pointer within a file has no cost. That is just
> changing a number in the open file descriptor on the client, so it
> doesn't involve the servers or any kind of locking.
>
>> Q3. For question 2, would the answer be different if the read is not
>> aligned to the stripe-size? Meaning that in both strategies I would
>> have to skip to an offset (compared to Q2 where I could just read the
>> whole file in strategy B from the start), but in strategy A the skip
>> is bigger.
>>
> Same answer as 2 - the seeking itself has no cost. The *read* of
> unaligned data in this case is likely to be somewhat slower than
> reading aligned data (it may send RPCs to two OSTs, needing two
> separate locks, etc). However, with any large-sized read (e.g. 8 MB+)
> it is unlikely to make a significant difference.
>
>> Q4. One concern I have regarding strategy A is that all the stripes
>> of the file that are in the same OST are seen -internally- as one
>> object (as per "Understanding Lustre Internals"). Does this affect
>> performance when different, but not overlapping, parts of the file
>> (that are on the same OST) are being accessed (for example due to
>> locking)? Does it matter if the parts being accessed are on different
>> "chunk", e.g 1st and 3rd MB on the above example?
>>
>
> No, Lustre can allow concurrent read access to a single object from
> multiple threads/clients. When writing the file, there can also be
> concurrent write access to a single object, but only with
> non-overlapping regions. That would also be true if writing to
> separate files in option B (contention if two processes tried to write
> the same small file).
>
>> Also if there are any additional docs I can read on those topics
>> (apart from "Understanding Lustre internals") to get a better
>> understanding, please do point them out.
>>
> Patrick Farrell has presented at LAD and LUG a few times about
> optimizations to the IO pipeline, which may be interesting:
> https://wiki.lustre.org/Lustre_User_Group_2022
> - https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf
> https://www.eofs.eu/index.php/events/lad-23/
> -
> https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf
> https://wiki.lustre.org/Lustre_User_Group_2024
> -
> https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf
>
>> Thanks again for your help,
>>
>> Apostolis
>>
>>
>> On 9/23/24 00:42, Andreas Dilger wrote:
>>> On Sep 18, 2024, at 10:47, Apostolis Stamatis <el18034 at mail.ntua.gr>
>>> wrote:
>>>> I am trying to read/write a specific stripe for files striped
>>>> across multiple OSTs. I've been looking around the C api but with
>>>> no success so far.
>>>>
>>>>
>>>> Let's say I have a big file which is striped across multiple OSTs.
>>>> I have a cluster of compute nodes which perform some computation on
>>>> the data of the file. Each node needs only a subset of that data.
>>>>
>>>> I want each node to be able to read/write only the needed
>>>> information, so that all reads/writes can happen in parallel. The
>>>> desired data may or may not be aligned with the stripes (this is
>>>> secondary).
>>>>
>>>> It is my understanding that stripes are just parts of the file.
>>>> Meaning that if I have an array of 100 rows and stripe A contains
>>>> the first half, then it would contain the first 50 rows, is this
>>>> correct?
>>>
>>> This is not totally correct. The location of the data depends on
>>> the size of the data and the stripe size.
>>>
>>> For a 1-stripe file (the default unless otherwise specified) then
>>> all of the data would be in a single object, regardless of the size
>>> of the data.
>>>
>>> For a 2-stripe file with stripe_size=1MiB, then the first MB of data
>>> [0-1MB) is on object 0, the second MB of data [1-2MB) is on object
>>> 1, and the third MB of data [2-3MB) is back on object 0, etc.
>>>
>>> See
>>> https://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts for
>>> example.
>>>
>>>> To sum up my questions are:
>>>>
>>>> 1) Can I read/write a specific stripe of a file via the C api to
>>>> achieve better performance/locality?
>>>
>>> There is no Lustre llapi_* interface that provides this
>>> functionality, but you can of course read the file with regular
>>> read() or preferably pread() or readv() calls with the right file
>>> offsets.
>>>
>>>> 2) Is it correct that stripes include parts of the file, meaning
>>>> the raw data? If not, can the raw data be extracted from any
>>>> additional information stored in the stripe?
>>>
>>> For example, if you have a 4-stripe file, then the application
>>> should read every 4th MB of the file to stay on the same OST object.
>>> Note that the *OST* index is not necessarily the same as the
>>> *stripe* number of the file. To read the file from the local OST
>>> then it should check the local OST index and select that OST index
>>> from the file to determine the offset from the start of the file =
>>> stripe_size * stripe_number.
>>>
>>> However, you could also do this more easily by having a bunch of
>>> 1-stripe files and doing the reads directly on the local OSTs. You
>>> would run "lfs find DIR -i LOCAL_OST_IDX" to get a list of the files
>>> on each OST, and then process them directly.
>>>
>>>> 3) If each compute node is run on top of a different OST where
>>>> stripes of the file are stored, would it be better in terms of
>>>> performance to have the node read the stripe of its OST? (because
>>>> e.g. it avoids data transfer over the network)
>>>
>>> This is not necessarily needed, if you have a good network, but it
>>> depends on the workload. Local PCI storage access is about the same
>>> speed as remote PCI network access because they are limited by the
>>> PCI bus bandwidth. You would notice a difference is if you have a
>>> large number of clients and they are completely IO-bound that
>>> overwhelm the storage.
>>>
>>> Cheers, Andreas
>>> --
>>> Andreas Dilger
>>> Lustre Principal Architect
>>> Whamcloud
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20241211/6e0076e5/attachment-0001.htm>
More information about the lustre-discuss
mailing list