[Lustre-discuss] Poor multithreaded I/O performance

Thu Jun 2 17:06:48 PDT 2011

Hello,
I was wondering if anyone could replicate the performance of the
multithreaded application using the C file that I posted in my previous
email.

Thanks,
Kshitij

> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks of 8
> Mbytes for the same stripe size. (Thus, every thread will write to every
> OST). I still get similar performance, ~280Mbytes/sec, so essentially I
> see no difference between each thread writing to a single OST vs each
> thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file, the
> resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could run
> it and see where the bottleneck is. Comments and instructions for
> compilation have been included in the file. Do let me know if you need any
> clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB
>> Thread 2, data_size: 1MB, offset: 2MB
>> Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB
>> and so on (This happens in parallel, I dont wait for one cycle to end
>> before the next one begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests instead
>> of
>> merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in
>> the
>> email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to the same
>>> file (non-overlapping locations), without explicitly synchronizing the
>>> writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data?  Are
>>> you using pwrite to ensure non-overlapping regions or are they all just
>>> doing unlocked write() operations on the same fd to each write (each
>>> just transferring size/8)?  If it divides the file into N pieces, and
>>> each thread does pwrite on its piece, then what each OST sees are
>>> multiple streams at wide offsets to the same object, which could impact
>>> performance.
>>>
>>> If on the other hand the file is written sequentially, where each
>>> thread
>>> grabs the next piece to be written (locking normally used for the
>>> current_offset value, so you know where each chunk is actually going),
>>> then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some modulo,
>>> like in your case 6 OSTs per thread), and each thread "owns" the piece
>>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
>>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
>>> you've eliminated the need for application locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It's quite possible there is some bottleneck on the shared fd.  So
>>> perhaps the question is not why you aren't scaling with more threads,
>>> but why the single file is not able to saturate the client, or why the
>>> file BW is not scaling with more OSTs.  It is somewhat common for
>>> multiple processes (on different nodes) to write non-overlapping
>>> regions
>>> of the same file; does performance improve if each thread opens its own
>>> file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output file is
>>>> striped across 48 of them. May I suggest that you limit number of
>>>> stripes, lets say a good number to start with would be 8 stripes and
>>>> also for best results use OST pools feature to arrange that each
>>>> stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>>
>>>> wrote:
>>>>
>>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>>     presume the
>>>>     system documentation is out of date.
>>>>
>>>>     Again, I am sorry the basic information had been incorrect.
>>>>
>>>>     - Kshitij
>>>>
>>>>     > Run lfs getstripe <your_output_file> and paste the output of
>>>>     that command
>>>>     > to
>>>>     > the mailing list.
>>>>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>>     max stripe
>>>>     > count will be 11)
>>>>     > If your striping is correct, the bottleneck can be your client
>>>>     network.
>>>>     >
>>>>     > regards,
>>>>     >
>>>>     > Wojciech
>>>>     >
>>>>     >
>>>>     >
>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>     >
>>>>     >> The stripe count is 48.
>>>>     >>
>>>>     >> Just fyi, this is what my application does:
>>>>     >> A simple I/O test where threads continually write blocks of
>>>> size
>>>>     >> 64Kbytes
>>>>     >> or 1Mbyte (decided at compile time) till a large file of say,
>>>>     16Gbytes
>>>>     >> is
>>>>     >> created.
>>>>     >>
>>>>     >> Thanks,
>>>>     >> Kshitij
>>>>     >>
>>>>     >> > What is your stripe count on the file,  if your default is 1,
>>>>     you are
>>>>     >> only
>>>>     >> > writing to one of the OST's.  you can check with the lfs
>>>>     getstripe
>>>>     >> > command, you can set the stripe bigger, and hopefully your
>>>>     >> wide-stripped
>>>>     >> > file with threaded writes will be faster.
>>>>     >> >
>>>>     >> > Evan
>>>>     >> >
>>>>     >> > -----Original Message-----
>>>>     >> > From: lustre-community-bounces at lists.lustre.org
>>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>>     >> > [mailto:lustre-community-bounces at lists.lustre.org
>>>>     <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of
>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>     >> > To: lustre-community at lists.lustre.org
>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>     >> > Subject: [Lustre-community] Poor multithreaded I/O
>>>> performance
>>>>     >> >
>>>>     >> > Hello,
>>>>     >> > I am running a multithreaded application that writes to a
>>>> common
>>>>     >> shared
>>>>     >> > file on lustre fs, and this is what I see:
>>>>     >> >
>>>>     >> > If I have a single thread in my application, I get a
>>>> bandwidth
>>>> of
>>>>     >> approx.
>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>>     spawn 8
>>>>     >> > threads such that all of them write to the same file
>>>>     (non-overlapping
>>>>     >> > locations), without explicitly synchronizing the writes (i.e.
>>>>     I dont
>>>>     >> lock
>>>>     >> > the file handle), I still get the same bandwidth.
>>>>     >> >
>>>>     >> > Now, instead of writing to a shared file, if these threads
>>>>     write to
>>>>     >> > separate files, the bandwidth obtained is approx. 700
>>>> Mbytes/sec.
>>>>     >> >
>>>>     >> > I would ideally like my multithreaded application to see
>>>> similar
>>>>     >> scaling.
>>>>     >> > Any ideas why the performance is limited and any workarounds?
>>>>     >> >
>>>>     >> > Thank you,
>>>>     >> > Kshitij
>>>>     >> >
>>>>     >> >
>>>>     >> > _______________________________________________
>>>>     >> > Lustre-community mailing list
>>>>     >> > Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>     >> >
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Lustre-community mailing list
>>>>     >> Lustre-community at lists.lustre.org
>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>