[Lustre-discuss] Curious about iozone findings of new Lustre FS

Wed Mar 3 15:29:18 PST 2010

Or would it be better to increase the stripe count for my lustre filesystem
to the max number of OST's?

On Wed, Mar 3, 2010 at 3:27 PM, Jagga Soorma <jagga13 at gmail.com> wrote:

> On Wed, Mar 3, 2010 at 2:30 PM, Andreas Dilger <adilger at sun.com> wrote:
>
>> On 2010-03-03, at 12:50, Jagga Soorma wrote:
>>
>>> I have just deployed a new Lustre FS with 2 MDS servers, 2 active OSS
>>> servers (5x2TB OST's per OSS) and 16 compute nodes.
>>>
>>
>> Does this mean you are using 5 2TB disks in a single RAID-5 OST per OSS
>> (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?
>
>
> No I am using 5 independent 2TB OST's per OSS.
>
>
>>
>>
>>  Attached are our findings from the iozone tests and it looks like the
>>> iozone throughput tests have demonstrated almost linear scalability of
>>> Lustre except for when WRITING files that exceed 128MB in size.  When
>>> multiple clients create/write files larger than 128MB, Lustre throughput
>>> levels up to approximately ~1GB/s. This behavior has been observed with
>>> almost all tested block size ranges except for 4KB.  I don't have any
>>> explanation as to why Lustre performs poorly when writing large files.
>>>
>>> Has anyoned experienced this behaviour?  Any comments on our findings?
>>>
>>
>>
>> The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum
>> amount of unwritten dirty data per OST before blocking the process
>> submitting IO).  If you have 2 OST/OSCs and you have a stripe count of 2
>> then you can cache up to 64MB on the client without having to wait for any
>> RPCs to complete.  That is why you see a performance cliff for writes beyond
>> 32MB.
>>
>
> So the true write performance should be measured for data captured for
> files larger than 128MB?  If we do see a large number of large files being
> created on the lustre fs, is this something that can be tuned on the client
> side?  If so, where/how can I get this done and what would be the
> recommended settings?
>
>
>> It should be clear that the read graphs are meaningless, due to local
>> cache of the file.  I'd hazard a guess that you are not getting 100GB/s from
>> 2 OSS nodes.
>>
>
> Agreed.  Is there a way to find out the size of the local cache on the
> clients?
>
>
>>
>> Also, what is the interconnect on the client?  If you are using a single
>> 10GigE then 1GB/s is as fast as you can possibly write large files to the
>> OSTs, regardless of the striping.
>>
>
> I am using Infiniband (QDR) interconnects for all nodes.
>
>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20100303/c991e3bc/attachment.htm>