[Lustre-discuss] Curious about iozone findings of new Lustre FS

Wed Mar 3 14:30:56 PST 2010

On 2010-03-03, at 12:50, Jagga Soorma wrote:
> I have just deployed a new Lustre FS with 2 MDS servers, 2 active  
> OSS servers (5x2TB OST's per OSS) and 16 compute nodes.

Does this mean you are using 5 2TB disks in a single RAID-5 OST per  
OSS (i.e. total OST size is 8TB), or are you using 5 separate 2TB OSTs?

> Attached are our findings from the iozone tests and it looks like  
> the iozone throughput tests have demonstrated almost linear  
> scalability of Lustre except for when WRITING files that exceed  
> 128MB in size.  When multiple clients create/write files larger than  
> 128MB, Lustre throughput levels up to approximately ~1GB/s. This  
> behavior has been observed with almost all tested block size ranges  
> except for 4KB.  I don't have any explanation as to why Lustre  
> performs poorly when writing large files.
>
> Has anyoned experienced this behaviour?  Any comments on our findings?

The default client tunable max_dirty_mb=32MB per OSC (i.e. the maximum  
amount of unwritten dirty data per OST before blocking the process  
submitting IO).  If you have 2 OST/OSCs and you have a stripe count of  
2 then you can cache up to 64MB on the client without having to wait  
for any RPCs to complete.  That is why you see a performance cliff for  
writes beyond 32MB.

It should be clear that the read graphs are meaningless, due to local  
cache of the file.  I'd hazard a guess that you are not getting 100GB/ 
s from 2 OSS nodes.

Also, what is the interconnect on the client?  If you are using a  
single 10GigE then 1GB/s is as fast as you can possibly write large  
files to the OSTs, regardless of the striping.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.