[Lustre-discuss] Poor multithreaded I/O performance
Felix, Evan J
Evan.Felix at pnnl.gov
Fri Jun 3 12:51:43 PDT 2011
I've been trying to test this, but not finding an obvious error... so more questions:
How much RAM do you have on your client, and how much on the OST's some of my smaller tests go much faster, but I believe that it is cache based effects. My larger test at 32GB gives pretty consistent results.
The other thing to consider: are the separate files being striped 8 ways? Because that would allow them to hit possibly all 64 OST's, while the shared file case will only hit 8.
Evan
-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Felix, Evan J
Sent: Friday, June 03, 2011 9:09 AM
To: kmehta at cs.uh.edu
Cc: Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
What file sizes and segment sizes are you using for your tests?
Evan
-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of kmehta at cs.uh.edu
Sent: Thursday, June 02, 2011 5:07 PM
To: kmehta at cs.uh.edu
Cc: kmehta at cs.uh.edu; Lustre discuss
Subject: Re: [Lustre-discuss] Poor multithreaded I/O performance
Hello,
I was wondering if anyone could replicate the performance of the multithreaded application using the C file that I posted in my previous email.
Thanks,
Kshitij
> Ok I ran the following tests:
>
> [1]
> Application spawns 8 threads. I write to Lustre having 8 OSTs.
> Each thread writes data in blocks of 1 Mbyte in a round robin fashion,
> i.e.
>
> T0 writes to offsets 0, 8MB, 16MB, etc.
> T1 writes to offsets 1MB, 9MB, 17MB, etc.
> The stripe size being 1MByte, every thread ends up writing to only 1 OST.
>
> I see a bandwidth of 280 Mbytes/sec, similar to the single thread
> performance.
>
> [2]
> I also ran the same test such that every thread writes data in blocks
> of 8 Mbytes for the same stripe size. (Thus, every thread will write
> to every OST). I still get similar performance, ~280Mbytes/sec, so
> essentially I see no difference between each thread writing to a
> single OST vs each thread writing to all OSTs.
>
> And as I said before, if all threads write to their own separate file,
> the resulting bandwidth is ~700Mbytes/sec.
>
> I have attached my C file (simple_io_test.c) herewith. Maybe you could
> run it and see where the bottleneck is. Comments and instructions for
> compilation have been included in the file. Do let me know if you need
> any clarification on that.
>
> Your help is appreciated,
> Kshitij
>
>> This is what my application does:
>>
>> Each thread has its own file descriptor to the file.
>> I use pwrite to ensure non-overlapping regions, as follows:
>>
>> Thread 0, data_size: 1MB, offset: 0
>> Thread 1, data_size: 1MB, offset: 1MB Thread 2, data_size: 1MB,
>> offset: 2MB Thread 3, data_size: 1MB, offset: 3MB
>>
>> <repeat cycle>
>> Thread 0, data_size: 1MB, offset: 4MB and so on (This happens in
>> parallel, I dont wait for one cycle to end before the next one
>> begins).
>>
>> I am gonna try the following:
>> a)
>> Instead of a round-robin distribution of offsets, test with
>> sequential
>> offsets:
>> Thread 0, data_size: 1MB, offset:0
>> Thread 0, data_size: 1MB, offset:1MB
>> Thread 0, data_size: 1MB, offset:2MB
>> Thread 0, data_size: 1MB, offset:3MB
>>
>> Thread 1, data_size: 1MB, offset:4MB
>> and so on. (I am gonna keep these separate pwrite I/O requests
>> instead of merging them or using writev)
>>
>> b)
>> Map the threads to the no. of OSTs using some modulo, as suggested in
>> the email below.
>>
>> c)
>> Experiment with fewer no. of OSTs (I currently have 48).
>>
>> I shall report back with my findings.
>>
>> Thanks,
>> Kshitij
>>
>>> [Moved to Lustre-discuss]
>>>
>>>
>>> "However, if I spawn 8 threads such that all of them write to the
>>> same file (non-overlapping locations), without explicitly
>>> synchronizing the writes (i.e. I dont lock the file handle)"
>>>
>>>
>>> How exactly does your multi-threaded application write the data?
>>> Are you using pwrite to ensure non-overlapping regions or are they
>>> all just doing unlocked write() operations on the same fd to each
>>> write (each just transferring size/8)? If it divides the file into
>>> N pieces, and each thread does pwrite on its piece, then what each
>>> OST sees are multiple streams at wide offsets to the same object,
>>> which could impact performance.
>>>
>>> If on the other hand the file is written sequentially, where each
>>> thread grabs the next piece to be written (locking normally used for
>>> the current_offset value, so you know where each chunk is actually
>>> going), then you get a more sequential pattern at the OST.
>>>
>>> If the number of threads maps to the number of OSTs (or some modulo,
>>> like in your case 6 OSTs per thread), and each thread "owns" the
>>> piece of the file that belongs to an OST (ie: for (offset =
>>> thread_num * 6MB; offset < size; offset += 48MB) pwrite(fd, buf,
>>> 6MB, offset); ), then you've eliminated the need for application
>>> locks (assuming the use of
>>> pwrite) and ensured each OST object is being written sequentially.
>>>
>>> It's quite possible there is some bottleneck on the shared fd. So
>>> perhaps the question is not why you aren't scaling with more
>>> threads, but why the single file is not able to saturate the client,
>>> or why the file BW is not scaling with more OSTs. It is somewhat
>>> common for multiple processes (on different nodes) to write
>>> non-overlapping regions of the same file; does performance improve
>>> if each thread opens its own file descriptor?
>>>
>>> Kevin
>>>
>>>
>>> Wojciech Turek wrote:
>>>> Ok so it looks like you have in total 64 OSTs and your output file
>>>> is striped across 48 of them. May I suggest that you limit number
>>>> of stripes, lets say a good number to start with would be 8 stripes
>>>> and also for best results use OST pools feature to arrange that
>>>> each stripe goes to OST owned by different OSS.
>>>>
>>>> regards,
>>>>
>>>> Wojciech
>>>>
>>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>>
>>>> wrote:
>>>>
>>>> Actually, 'lfs check servers' returns 64 entries as well, so I
>>>> presume the
>>>> system documentation is out of date.
>>>>
>>>> Again, I am sorry the basic information had been incorrect.
>>>>
>>>> - Kshitij
>>>>
>>>> > Run lfs getstripe <your_output_file> and paste the output of
>>>> that command
>>>> > to
>>>> > the mailing list.
>>>> > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>> max stripe
>>>> > count will be 11)
>>>> > If your striping is correct, the bottleneck can be your client
>>>> network.
>>>> >
>>>> > regards,
>>>> >
>>>> > Wojciech
>>>> >
>>>> >
>>>> >
>>>> > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>> <mailto:kmehta at cs.uh.edu>> wrote:
>>>> >
>>>> >> The stripe count is 48.
>>>> >>
>>>> >> Just fyi, this is what my application does:
>>>> >> A simple I/O test where threads continually write blocks of
>>>> size
>>>> >> 64Kbytes
>>>> >> or 1Mbyte (decided at compile time) till a large file of say,
>>>> 16Gbytes
>>>> >> is
>>>> >> created.
>>>> >>
>>>> >> Thanks,
>>>> >> Kshitij
>>>> >>
>>>> >> > What is your stripe count on the file, if your default is 1,
>>>> you are
>>>> >> only
>>>> >> > writing to one of the OST's. you can check with the lfs
>>>> getstripe
>>>> >> > command, you can set the stripe bigger, and hopefully your
>>>> >> wide-stripped
>>>> >> > file with threaded writes will be faster.
>>>> >> >
>>>> >> > Evan
>>>> >> >
>>>> >> > -----Original Message-----
>>>> >> > From: lustre-community-bounces at lists.lustre.org
>>>> <mailto:lustre-community-bounces at lists.lustre.org>
>>>> >> > [mailto:lustre-community-bounces at lists.lustre.org
>>>> <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of
>>>> >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>>>> >> > Sent: Monday, May 23, 2011 2:28 PM
>>>> >> > To: lustre-community at lists.lustre.org
>>>> <mailto:lustre-community at lists.lustre.org>
>>>> >> > Subject: [Lustre-community] Poor multithreaded I/O
>>>> performance
>>>> >> >
>>>> >> > Hello,
>>>> >> > I am running a multithreaded application that writes to a
>>>> common
>>>> >> shared
>>>> >> > file on lustre fs, and this is what I see:
>>>> >> >
>>>> >> > If I have a single thread in my application, I get a
>>>> bandwidth of
>>>> >> approx.
>>>> >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>> spawn 8
>>>> >> > threads such that all of them write to the same file
>>>> (non-overlapping
>>>> >> > locations), without explicitly synchronizing the writes (i.e.
>>>> I dont
>>>> >> lock
>>>> >> > the file handle), I still get the same bandwidth.
>>>> >> >
>>>> >> > Now, instead of writing to a shared file, if these threads
>>>> write to
>>>> >> > separate files, the bandwidth obtained is approx. 700
>>>> Mbytes/sec.
>>>> >> >
>>>> >> > I would ideally like my multithreaded application to see
>>>> similar
>>>> >> scaling.
>>>> >> > Any ideas why the performance is limited and any workarounds?
>>>> >> >
>>>> >> > Thank you,
>>>> >> > Kshitij
>>>> >> >
>>>> >> >
>>>> >> > _______________________________________________
>>>> >> > Lustre-community mailing list
>>>> >> > Lustre-community at lists.lustre.org
>>>> <mailto:Lustre-community at lists.lustre.org>
>>>> >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>> >> >
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> Lustre-community mailing list
>>>> >> Lustre-community at lists.lustre.org
>>>> <mailto:Lustre-community at lists.lustre.org>
>>>> >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> -----
>>>>
>>>> _______________________________________________
>>>> Lustre-community mailing list
>>>> Lustre-community at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>>
>>>
>>
>>
>
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list