[Lustre-discuss] high IOPS

Wed Dec 2 14:41:16 PST 2009

On 2009-12-02, at 12:15, Craig Tierney wrote:
> Andreas Dilger wrote:
>> On 2009-12-02, at 09:20, Francois Chassaing wrote:
>>> I have a big fundamental question :
>>> if the load that I'll put on the FS is more IOPS-intensive than
>>> throughput-intensive (because I'll access lots of medium-sized files
>>> ~5 MB from a small number of clients), should I better go Lustre or
>>> PVFS2 ?
>>
>> I don't think PVFS2 is necessarily better at IOPS than Lustre.  This
>> is mostly dependent upon the storage configuration.
>>
>>> Also, if the main load is IOPS, shouldn't I oversize MDS/MDT in
>>> terms of CPU/RAM and storage perf (ie. : max of 15K SAS RAID10
>>> spindles possible) ?
>>
>> The Lustre MDS/MDT is used only at file lookup/open/close, but is not
>> involved during actual IO operations.  Still, this means in your case
>> that the MDS is getting 2 RPCs (open + close, which can be done
>> asynchronously in memory) for every 5 OST RPCs (5MB read/write, which
>> happen synchronously), so the MDS will definitely need to scale but
>> not necessarily at 2/5 of the total OST size.
>>
>> Typical numbers for a high-end MDT node (16-core, 64GB of RAM, DDR  
>> IB)
>> is about 8-10k creates/sec, up to 20k lookups/sec from many clients.
>>
>> Depending on the number of files you are planning to have in the
>> filesystem, I would suggest SSDs for the MDT filesystem, especially  
>> if
>> you have a large working set and are doing read-mostly access.
>
> Has anyone reported results of an SSD based MDT?

We have done internal testing, and the performance for many workloads  
is somewhat faster, but not a TON faster.  This is because Lustre is  
already doing async IO on the MDS, unlike NFS, so decent streaming IO  
performance and lots of RAM meet many of the create/lookup performance  
targets.

If you have a huge filesystem that is doing a lot of random lookup,  
create, and unlink operations (i.e. the working set is larger than the  
MDS RAM, about 4kB per file for random operations, 16M files on a 64GB  
MDS) then the high IOPS rate of SSDs will definitely make a huge  
difference (i.e. keeping 20k lookups/sec on DDR instead of falling to  
mdt_disks * 100).

Since that isn't a common workload for our customers, we haven't done  
a lot of testing in that area, but it is definitely something I'm  
curious about.

>>> on the budget side, may I use asynchronous DRBD to mirror MDT
>>> (internal storage), or should I only got a good shared storage
>>> (direct or iscsi) ?
>>
>> Some people on this list have used DRBD, but we haven't tested it
>> ourselves.  I _suspect_ (though have not necessarily tested this)  
>> that
>> if you are using DRBD it would be possible to have lower-performance
>> storage on the backup server without significantly impacting the
>> primary server performance, if you are willing to run slower in the
>> rare case when you are failed-over to the backup.
>>
>>> Today I'm leaning towards Lustre, because I've tested it against
>>> glusterfs, and gluster performed little less good than lustre but
>>> poorly failed the bonnie++ create/delete tests. Also I didn't gave a
>>> shot at PVFS2 yet...
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
> -- 
> Craig Tierney (craig.tierney at noaa.gov)
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.