[Lustre-discuss] Poor multithreaded I/O performance

Thu May 26 12:02:19 PDT 2011

Ok I ran the following tests:

[1]
Application spawns 8 threads. I write to Lustre having 8 OSTs.
Each thread writes data in blocks of 1 Mbyte in a round robin fashion, i.e.

T0 writes to offsets 0, 8MB, 16MB, etc.
T1 writes to offsets 1MB, 9MB, 17MB, etc.
The stripe size being 1MByte, every thread ends up writing to only 1 OST.

I see a bandwidth of 280 Mbytes/sec, similar to the single thread
performance.

[2]
I also ran the same test such that every thread writes data in blocks of 8
Mbytes for the same stripe size. (Thus, every thread will write to every
OST). I still get similar performance, ~280Mbytes/sec, so essentially I
see no difference between each thread writing to a single OST vs each
thread writing to all OSTs.

And as I said before, if all threads write to their own separate file, the
resulting bandwidth is ~700Mbytes/sec.

I have attached my C file (simple_io_test.c) herewith. Maybe you could run
it and see where the bottleneck is. Comments and instructions for
compilation have been included in the file. Do let me know if you need any
clarification on that.

Your help is appreciated,
Kshitij

> This is what my application does:
>
> Each thread has its own file descriptor to the file.
> I use pwrite to ensure non-overlapping regions, as follows:
>
> Thread 0, data_size: 1MB, offset: 0
> Thread 1, data_size: 1MB, offset: 1MB
> Thread 2, data_size: 1MB, offset: 2MB
> Thread 3, data_size: 1MB, offset: 3MB
>
> <repeat cycle>
> Thread 0, data_size: 1MB, offset: 4MB
> and so on (This happens in parallel, I dont wait for one cycle to end
> before the next one begins).
>
> I am gonna try the following:
> a)
> Instead of a round-robin distribution of offsets, test with sequential
> offsets:
> Thread 0, data_size: 1MB, offset:0
> Thread 0, data_size: 1MB, offset:1MB
> Thread 0, data_size: 1MB, offset:2MB
> Thread 0, data_size: 1MB, offset:3MB
>
> Thread 1, data_size: 1MB, offset:4MB
> and so on. (I am gonna keep these separate pwrite I/O requests instead of
> merging them or using writev)
>
> b)
> Map the threads to the no. of OSTs using some modulo, as suggested in the
> email below.
>
> c)
> Experiment with fewer no. of OSTs (I currently have 48).
>
> I shall report back with my findings.
>
> Thanks,
> Kshitij
>
>> [Moved to Lustre-discuss]
>>
>>
>> "However, if I spawn 8 threads such that all of them write to the same
>> file (non-overlapping locations), without explicitly synchronizing the
>> writes (i.e. I dont lock the file handle)"
>>
>>
>> How exactly does your multi-threaded application write the data?  Are
>> you using pwrite to ensure non-overlapping regions or are they all just
>> doing unlocked write() operations on the same fd to each write (each
>> just transferring size/8)?  If it divides the file into N pieces, and
>> each thread does pwrite on its piece, then what each OST sees are
>> multiple streams at wide offsets to the same object, which could impact
>> performance.
>>
>> If on the other hand the file is written sequentially, where each thread
>> grabs the next piece to be written (locking normally used for the
>> current_offset value, so you know where each chunk is actually going),
>> then you get a more sequential pattern at the OST.
>>
>> If the number of threads maps to the number of OSTs (or some modulo,
>> like in your case 6 OSTs per thread), and each thread "owns" the piece
>> of the file that belongs to an OST (ie: for (offset = thread_num * 6MB;
>> offset < size; offset += 48MB) pwrite(fd, buf, 6MB, offset); ), then
>> you've eliminated the need for application locks (assuming the use of
>> pwrite) and ensured each OST object is being written sequentially.
>>
>> It's quite possible there is some bottleneck on the shared fd.  So
>> perhaps the question is not why you aren't scaling with more threads,
>> but why the single file is not able to saturate the client, or why the
>> file BW is not scaling with more OSTs.  It is somewhat common for
>> multiple processes (on different nodes) to write non-overlapping regions
>> of the same file; does performance improve if each thread opens its own
>> file descriptor?
>>
>> Kevin
>>
>>
>> Wojciech Turek wrote:
>>> Ok so it looks like you have in total 64 OSTs and your output file is
>>> striped across 48 of them. May I suggest that you limit number of
>>> stripes, lets say a good number to start with would be 8 stripes and
>>> also for best results use OST pools feature to arrange that each
>>> stripe goes to OST owned by different OSS.
>>>
>>> regards,
>>>
>>> Wojciech
>>>
>>> On 23 May 2011 23:09, <kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>>
>>> wrote:
>>>
>>>     Actually, 'lfs check servers' returns 64 entries as well, so I
>>>     presume the
>>>     system documentation is out of date.
>>>
>>>     Again, I am sorry the basic information had been incorrect.
>>>
>>>     - Kshitij
>>>
>>>     > Run lfs getstripe <your_output_file> and paste the output of
>>>     that command
>>>     > to
>>>     > the mailing list.
>>>     > Stripe count of 48 is not possible if you have max 11 OSTs (the
>>>     max stripe
>>>     > count will be 11)
>>>     > If your striping is correct, the bottleneck can be your client
>>>     network.
>>>     >
>>>     > regards,
>>>     >
>>>     > Wojciech
>>>     >
>>>     >
>>>     >
>>>     > On 23 May 2011 22:35, <kmehta at cs.uh.edu
>>>     <mailto:kmehta at cs.uh.edu>> wrote:
>>>     >
>>>     >> The stripe count is 48.
>>>     >>
>>>     >> Just fyi, this is what my application does:
>>>     >> A simple I/O test where threads continually write blocks of size
>>>     >> 64Kbytes
>>>     >> or 1Mbyte (decided at compile time) till a large file of say,
>>>     16Gbytes
>>>     >> is
>>>     >> created.
>>>     >>
>>>     >> Thanks,
>>>     >> Kshitij
>>>     >>
>>>     >> > What is your stripe count on the file,  if your default is 1,
>>>     you are
>>>     >> only
>>>     >> > writing to one of the OST's.  you can check with the lfs
>>>     getstripe
>>>     >> > command, you can set the stripe bigger, and hopefully your
>>>     >> wide-stripped
>>>     >> > file with threaded writes will be faster.
>>>     >> >
>>>     >> > Evan
>>>     >> >
>>>     >> > -----Original Message-----
>>>     >> > From: lustre-community-bounces at lists.lustre.org
>>>     <mailto:lustre-community-bounces at lists.lustre.org>
>>>     >> > [mailto:lustre-community-bounces at lists.lustre.org
>>>     <mailto:lustre-community-bounces at lists.lustre.org>] On Behalf Of
>>>     >> > kmehta at cs.uh.edu <mailto:kmehta at cs.uh.edu>
>>>     >> > Sent: Monday, May 23, 2011 2:28 PM
>>>     >> > To: lustre-community at lists.lustre.org
>>>     <mailto:lustre-community at lists.lustre.org>
>>>     >> > Subject: [Lustre-community] Poor multithreaded I/O performance
>>>     >> >
>>>     >> > Hello,
>>>     >> > I am running a multithreaded application that writes to a
>>> common
>>>     >> shared
>>>     >> > file on lustre fs, and this is what I see:
>>>     >> >
>>>     >> > If I have a single thread in my application, I get a bandwidth
>>> of
>>>     >> approx.
>>>     >> > 250 MBytes/sec. (11 OSTs, 1MByte stripe size) However, if I
>>>     spawn 8
>>>     >> > threads such that all of them write to the same file
>>>     (non-overlapping
>>>     >> > locations), without explicitly synchronizing the writes (i.e.
>>>     I dont
>>>     >> lock
>>>     >> > the file handle), I still get the same bandwidth.
>>>     >> >
>>>     >> > Now, instead of writing to a shared file, if these threads
>>>     write to
>>>     >> > separate files, the bandwidth obtained is approx. 700
>>> Mbytes/sec.
>>>     >> >
>>>     >> > I would ideally like my multithreaded application to see
>>> similar
>>>     >> scaling.
>>>     >> > Any ideas why the performance is limited and any workarounds?
>>>     >> >
>>>     >> > Thank you,
>>>     >> > Kshitij
>>>     >> >
>>>     >> >
>>>     >> > _______________________________________________
>>>     >> > Lustre-community mailing list
>>>     >> > Lustre-community at lists.lustre.org
>>>     <mailto:Lustre-community at lists.lustre.org>
>>>     >> > http://lists.lustre.org/mailman/listinfo/lustre-community
>>>     >> >
>>>     >>
>>>     >>
>>>     >> _______________________________________________
>>>     >> Lustre-community mailing list
>>>     >> Lustre-community at lists.lustre.org
>>>     <mailto:Lustre-community at lists.lustre.org>
>>>     >> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lustre-community mailing list
>>> Lustre-community at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-community
>>>
>>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: simple_io_test.c
Type: text/x-csrc
Size: 9579 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110526/16b2680f/attachment.c>