[Lustre-discuss] Poor multithreaded I/O performance
kmehta at cs.uh.edu
kmehta at cs.uh.edu
Thu Jun 2 17:06:48 PDT 2011
Hello,
I was wondering if anyone could replicate the performance of the
multithreaded application using the C file that I posted in my previous
email.
Thanks,
Kshitij
> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks of 8
> Mbytes for the same stripe size. (Thus, every thread will write to every
> OST). I still get similar performance, ~280Mbytes/sec, so essentially I
> see no difference between each thread writing to a single OST vs each
> thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, the
> resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could run
> it and see where the bottleneck is. Comments and instructions for
> compilation have been included in the file. Do let me know if you need any
> clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB
>> Thread 2, data_size: 1MB, offset: 2MB
>> Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB
>> and so on (This happens in parallel, I dont wait for one cycle to end
>> before the next one begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests instead
>> of
>> merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in
>> the
>> email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to the same
>>> file (non-overlapping locations), without explicitly synchronizing the
>>> writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data? Are
>>> you using pwrite to ensure non-overlapping regions or are they all just
>>> doing unlocked write() operations on the same fd to each write (each
>>> just transferring size/8)? If it divides the file into N pieces, and
>>> each thread does pwrite on its piece, then what each OST sees are
>>> multiple streams at wide offsets to the same object, which could impact
>>> performance.
>>>
>>> If on the other hand the file is written sequentially, where each
>>> thread
>>> grabs the next piece to be written (locking normally used for the
>>> current_offset value, so you know where each chunk is actually going),
>>> then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some modulo,
>>> like in your case 6 OSTs per thread), and each thread "owns" the piece
>>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
>>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
>>> you've eliminated the need for application locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It's quite possible there is some bottleneck on the shared fd. So
>>> perhaps the question is not why you aren't scaling with more threads,
>>> but why the single file is not able to saturate the client, or why the
>>> file BW is not scaling with more OSTs. It is somewhat common for
>>> multiple processes (on different nodes) to write non-overlapping
>>> regions
>>> of the same file; does performance improve if each thread opens its own
>>> file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output file is
>>>> striped across 48 of them. May I suggest that you limit number of
>>>> stripes, lets say a good number to start with would be 8 stripes and
>>>> also for best results use OST pools feature to arrange that each
>>>> stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>>
>>>> wrote:
>>>>
>>>> Actually, 'lfs check servers' returns 64 entries as well, so I
>>>> presume the
>>>> system documentation is out of date.
>>>>
>>>> Again, I am sorry the basic information had been incorrect.
>>>>
>>>> - Kshitij
>>>>
>>>> > Run lfs getstripe <your_output_file> and paste the output of
>>>> that command
>>>> > to
>>>> > the mailing list.
>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>> max stripe
>>>> > count will be 11)
>>>> > If your striping is correct, the bottleneck can be your client
>>>> network.
>>>> >
>>>> > regards,
>>>> >
>>>> > Wojciech
>>>> >
>>>> >
>>>> >
>>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>> <mailto:kmehta at cs.uh.edu>> wrote:
>>>> >
>>>> >> The stripe count is 48.
>>>> >>
>>>> >> Just fyi, this is what my application does:
>>>> >> A simple I/O test where threads continually write blocks of
>>>> size
>>>> >> 64Kbytes
>>>> >> or 1Mbyte (decided at compile time) till a large file of say,
>>>> 16Gbytes
>>>> >> is
>>>> >> created.
>>>> >>
>>>> >> Thanks,
>>>> >> Kshitij
>>>> >>
>>>> >> > What is your stripe count on the file, if your default is 1,
>>>> you are
>>>> >> only
>>>> >> > writing to one of the OST's. you can check with the lfs
>>>> getstripe
>>>> >> > command, you can set the stripe bigger, and hopefully your
>>>> >> wide-stripped
>>>> >> > file with threaded writes will be faster.
>>>> >> >
>>>> >> > Evan
>>>> >> >
>>>> >> > -----Original Message-----
>>>> >> > From: lustre-community-bounces at lists.lustre.org
>>>> <mailto:lustre-community-bounces at lists.lustre.org>
>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org
>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of
>>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>>>> >> > Sent: Monday, May 23, 2011 2:28 PM
>>>> >> > To: lustre-community at lists.lustre.org
>>>> <mailto:lustre-community at lists.lustre.org>
>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O
>>>> performance
>>>> >> >
>>>> >> > Hello,
>>>> >> > I am running a multithreaded application that writes to a
>>>> common
>>>> >> shared
>>>> >> > file on lustre fs, and this is what I see:
>>>> >> >
>>>> >> > If I have a single thread in my application, I get a
>>>> bandwidth
>>>> of
>>>> >> approx.
>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>> spawn 8
>>>> >> > threads such that all of them write to the same file
>>>> (non-overlapping
>>>> >> > locations), without explicitly synchronizing the writes (i.e.
>>>> I dont
>>>> >> lock
>>>> >> > the file handle), I still get the same bandwidth.
>>>> >> >
>>>> >> > Now, instead of writing to a shared file, if these threads
>>>> write to
>>>> >> > separate files, the bandwidth obtained is approx. 700
>>>> Mbytes/sec.
>>>> >> >
>>>> >> > I would ideally like my multithreaded application to see
>>>> similar
>>>> >> scaling.
>>>> >> > Any ideas why the performance is limited and any workarounds?
>>>> >> >
>>>> >> > Thank you,
>>>> >> > Kshitij
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Lustre-community mailing list
>>>> >> > Lustre-community at lists.lustre.org
>>>> <mailto:Lustre-community at lists.lustre.org>
>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>> >> >
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> Lustre-community mailing list
>>>> >> Lustre-community at lists.lustre.org
>>>> <mailto:Lustre-community at lists.lustre.org>
>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>
More information about the lustre-discuss
mailing list