[Lustre-discuss] Poor multithreaded I/O performance

Mon Jun 6 11:20:43 PDT 2011

> are the separate files being striped 8 ways?
>  Because that would allow them to hit possibly all 64 OST's, while the
> shared file case will only hit 8

Yes, I found out that the files are getting striped 8 ways, so we end up
hitting 64 OSTs. This is what I tried next:

1. Ran a test case where 6 threads write separate files, each of size 6
GB, to a directory configured over 8 OSTs. Thus the application writes
36GB of data in total, over 48 OSTs.

2. Ran a test case where 8 threads write a common file of size 36GB to a
directory configured over 48 OSTs.

Thus both tests ultimately write 36GB of data over 48 OSTS. I still see a
b/w of 240MBps for test 2 (common file), and b/w of 740 MBps for test 1
(separate files).

Thanks,
Kshitij

> I've been trying to test this, but not finding an obvious error...  so
> more questions:
>
> How much RAM do you have on your client, and how much on the OST's  some
> of my smaller tests go much faster, but I believe that it is cache based
> effects.  My larger test at 32GB gives pretty consistent results.
>
> The other thing to consider:  are the separate files being striped 8 ways?
>  Because that would allow them to hit possibly all 64 OST's, while the
> shared file case will only hit 8.
>
> Evan
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan
> J
> Sent: Friday, June 03, 2011 9:09 AM
> To: kmehta at cs.uh.edu
> Cc: Lustre discuss
> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>
> What file sizes and segment sizes are you using for your tests?
>
> Evan
>
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of
> kmehta at cs.uh.edu
> Sent: Thursday, June 02, 2011 5:07 PM
> To: kmehta at cs.uh.edu
> Cc: kmehta at cs.uh.edu; Lustre discuss
> Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
>
> Hello,
> I was wondering if anyone could replicate the performance of the
> multithreaded application using the C file that I posted in my previous
> email.
>
> Thanks,
> Kshitij
>
>
>> Ok I ran the following tests:
>>
>> [1]
>> Application spawns 8 threads. I write to Lustre having 8 OSTs.
>> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
>> i.e.
>>
>> T0 writes to offsets 0, 8MB, 16MB, etc.
>> T1 writes to offsets 1MB, 9MB, 17MB, etc.
>> The stripe size being 1MByte, every thread ends up writing to only 1
>> OST.
>>
>> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
>> performance.
>>
>> [2]
>> I also ran the same test such that every thread writes data in blocks
>> of 8 Mbytes for the same stripe size. (Thus, every thread will write
>> to every OST). I still get similar performance, ~280Mbytes/sec, so
>> essentially I see no difference between each thread writing to a
>> single OST vs each thread writing to all OSTs.
>>
>> And as I said before, if all threads write to their own separate file,
>> the resulting bandwidth is ~700Mbytes/sec.
>>
>> I have attached my C file (simple_io_test.c) herewith. Maybe you could
>> run it and see where the bottleneck is. Comments and instructions for
>> compilation have been included in the file. Do let me know if you need
>> any clarification on that.
>>
>> Your help is appreciated,
>> Kshitij
>>
>>> This is what my application does:
>>>
>>> Each thread has its own file descriptor to the file.
>>> I use pwrite to ensure non-overlapping regions, as follows:
>>>
>>> Thread 0, data_size: 1MB, offset: 0
>>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>>
>>> <repeat cycle>
>>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in
>>> parallel, I dont wait for one cycle to end before the next one
>>> begins).
>>>
>>> I am gonna try the following:
>>> a)
>>> Instead of a round-robin distribution of offsets, test with
>>> sequential
>>> offsets:
>>> Thread 0, data_size: 1MB, offset:0
>>> Thread 0, data_size: 1MB, offset:1MB
>>> Thread 0, data_size: 1MB, offset:2MB
>>> Thread 0, data_size: 1MB, offset:3MB
>>>
>>> Thread 1, data_size: 1MB, offset:4MB
>>> and so on. (I am gonna keep these separate pwrite I/O requests
>>> instead of merging them or using writev)
>>>
>>> b)
>>> Map the threads to the no. of OSTs using some modulo, as suggested in
>>> the email below.
>>>
>>> c)
>>> Experiment with fewer no. of OSTs (I currently have 48).
>>>
>>> I shall report back with my findings.
>>>
>>> Thanks,
>>> Kshitij
>>>
>>>> [Moved to Lustre-discuss]
>>>>
>>>>
>>>> "However, if I spawn 8 threads such that all of them write to the
>>>> same file (non-overlapping locations), without explicitly
>>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>>
>>>>
>>>> How exactly does your multi-threaded application write the data?
>>>> Are you using pwrite to ensure non-overlapping regions or are they
>>>> all just doing unlocked write() operations on the same fd to each
>>>> write (each just transferring size/8)?  If it divides the file into
>>>> N pieces, and each thread does pwrite on its piece, then what each
>>>> OST sees are multiple streams at wide offsets to the same object,
>>>> which could impact performance.
>>>>
>>>> If on the other hand the file is written sequentially, where each
>>>> thread grabs the next piece to be written (locking normally used for
>>>> the current_offset value, so you know where each chunk is actually
>>>> going), then you get a more sequential pattern at the OST.
>>>>
>>>> If the number of threads maps to the number of OSTs (or some modulo,
>>>> like in your case 6 OSTs per thread), and each thread "owns" the
>>>> piece of the file that belongs to an OST (ie: for (offset =
>>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf,
>>>> 6MB, offset); ), then you've eliminated the need for application
>>>> locks (assuming the use of
>>>> pwrite) and ensured each OST object is being written sequentially.
>>>>
>>>> It's quite possible there is some bottleneck on the shared fd.  So
>>>> perhaps the question is not why you aren't scaling with more
>>>> threads, but why the single file is not able to saturate the client,
>>>> or why the file BW is not scaling with more OSTs.  It is somewhat
>>>> common for multiple processes (on different nodes) to write
>>>> non-overlapping regions of the same file; does performance improve
>>>> if each thread opens its own file descriptor?
>>>>
>>>> Kevin
>>>>
>>>>
>>>> Wojciech Turek wrote:
>>>>> Ok so it looks like you have in total 64 OSTs and your output file
>>>>> is striped across 48 of them. May I suggest that you limit number
>>>>> of stripes, lets say a good number to start with would be 8 stripes
>>>>> and also for best results use OST pools feature to arrange that
>>>>> each stripe goes to OST owned by different OSS.
>>>>>
>>>>> regards,
>>>>>
>>>>> Wojciech
>>>>>
>>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>>
>>>>> wrote:
>>>>>
>>>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>>>     presume the
>>>>>     system documentation is out of date.
>>>>>
>>>>>     Again, I am sorry the basic information had been incorrect.
>>>>>
>>>>>     - Kshitij
>>>>>
>>>>>     > Run lfs getstripe <your_output_file> and paste the output of
>>>>>     that command
>>>>>     > to
>>>>>     > the mailing list.
>>>>>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>>>     max stripe
>>>>>     > count will be 11)
>>>>>     > If your striping is correct, the bottleneck can be your client
>>>>>     network.
>>>>>     >
>>>>>     > regards,
>>>>>     >
>>>>>     > Wojciech
>>>>>     >
>>>>>     >
>>>>>     >
>>>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>>>     >
>>>>>     >> The stripe count is 48.
>>>>>     >>
>>>>>     >> Just fyi, this is what my application does:
>>>>>     >> A simple I/O test where threads continually write blocks of
>>>>> size
>>>>>     >> 64Kbytes
>>>>>     >> or 1Mbyte (decided at compile time) till a large file of say,
>>>>>     16Gbytes
>>>>>     >> is
>>>>>     >> created.
>>>>>     >>
>>>>>     >> Thanks,
>>>>>     >> Kshitij
>>>>>     >>
>>>>>     >> > What is your stripe count on the file,  if your default is
>>>>> 1,
>>>>>     you are
>>>>>     >> only
>>>>>     >> > writing to one of the OST's.  you can check with the lfs
>>>>>     getstripe
>>>>>     >> > command, you can set the stripe bigger, and hopefully your
>>>>>     >> wide-stripped
>>>>>     >> > file with threaded writes will be faster.
>>>>>     >> >
>>>>>     >> > Evan
>>>>>     >> >
>>>>>     >> > -----Original Message-----
>>>>>     >> > From: lustre-community-bounces at lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>>>     >> > [mailto:lustre-community-bounces at lists.lustre.org
>>>>>     <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of
>>>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>>>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>>>     >> > To: lustre-community at lists.lustre.org
>>>>>     <mailto:lustre-community at lists.lustre.org>
>>>>>     >> > Subject: [Lustre-community] Poor multithreaded I/O
>>>>> performance
>>>>>     >> >
>>>>>     >> > Hello,
>>>>>     >> > I am running a multithreaded application that writes to a
>>>>> common
>>>>>     >> shared
>>>>>     >> > file on lustre fs, and this is what I see:
>>>>>     >> >
>>>>>     >> > If I have a single thread in my application, I get a
>>>>> bandwidth of
>>>>>     >> approx.
>>>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>>>     spawn 8
>>>>>     >> > threads such that all of them write to the same file
>>>>>     (non-overlapping
>>>>>     >> > locations), without explicitly synchronizing the writes
>>>>> (i.e.
>>>>>     I dont
>>>>>     >> lock
>>>>>     >> > the file handle), I still get the same bandwidth.
>>>>>     >> >
>>>>>     >> > Now, instead of writing to a shared file, if these threads
>>>>>     write to
>>>>>     >> > separate files, the bandwidth obtained is approx. 700
>>>>> Mbytes/sec.
>>>>>     >> >
>>>>>     >> > I would ideally like my multithreaded application to see
>>>>> similar
>>>>>     >> scaling.
>>>>>     >> > Any ideas why the performance is limited and any
>>>>> workarounds?
>>>>>     >> >
>>>>>     >> > Thank you,
>>>>>     >> > Kshitij
>>>>>     >> >
>>>>>     >> >
>>>>>     >> > _______________________________________________
>>>>>     >> > Lustre-community mailing list
>>>>>     >> > Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>     >> >
>>>>>     >>
>>>>>     >>
>>>>>     >> _______________________________________________
>>>>>     >> Lustre-community mailing list
>>>>>     >> Lustre-community at lists.lustre.org
>>>>>     <mailto:Lustre-community at lists.lustre.org>
>>>>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> -----
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-community mailing list
>>>>> Lustre-community at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>>
>>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>