[Lustre-discuss] Lustre requirements and tuning tricks

Kevin Van Maren Kevin.Van.Maren at oracle.com
Wed Sep 8 09:00:38 PDT 2010


On Sep 8, 2010, at 8:25 AM, Joe Landman  
<landman at scalableinformatics.com> wrote:

> Joan J. Piles wrote:
>
>> And then 2 MDS like these:
>>
>> - 2 x Intel 5520 (quad core) processor (or equivalent).
>> - 36Gb RAM.
>> - 2 x 64Gb SSD disks.
>> - 2 x10Gb Ethernet ports.
>
> Hmmm ....

In general there is not much gain from using SSD for MDT, and  
depending on the SSD, it could do much _worse_ than spinning rust.   
Many ssd controllers degrade horribly under the small random write  
workload.  (SSD are best for sequential write, random read).

Journals may receive some benefit, as the sequential write pattern  
works much better for SSDs, although SSDs are not normally needed there.


>
>> After having read the documentation, it seems to be a sensible
>> configuration, specially regarding the OSS. However we are not so  
>> sure
>> about the MDS. We have seen recommendations to reserve 5% of the  
>> total
>> file system space in the MDS. Is this true and then we should go for
>> 2x2Tb SAS disks for the MDS? Is SSD really worth there?
>
> There is a nice formula for approximating your MDS needs on the wiki.
> Basically it is something to the effect of
>
>    Number-of-inodes-planned * 1kB = storage space required
>
> So, for 10 million inodes, you need ~10 GB of space.  I am not sure if
> this helps, but you might be able to estimate your likely usage
> scenario.  Updating MDSes isn't easy (e.g. you have to pre-plan)


It is 4KB/inode on the MDT.  (It can be set to 2KB if you need 4  
billion files on an 8TB MDT).

My sizing rule of thumb has been ~ one MDT drive in RAID10 for each  
OST, to ensure you scale IOPS.


>
>> And we have also read about having a separate storage for the OSTs'
>> journals. Is it really useful to get a pair of extra small (16Gb) SSD
>> disks for each OST to keep the journals and bitmaps?

It doesn't have to be SSD, and bitmaps are only applicable for  
software RAID.  But unless you use asynchronous journals, there is  
normally a big win from external journals -- even with HW RAID having  
non-volatile storage.  The bug win is putting journals on raid 1,  
rather than raid5/6.


>>
>> Finally, we have also read that it's important to have different  
>> OSTs in
>> different physical drives to avoid bottlenecks. Is thas so if we  
>> make a
>> big RAID volume and then several logical volumes (done with the  
>> hardware
>> raid card, the operating system would just see different block  
>> devices)?
>
> Yes, though this will be suboptimal in performance.  You want  
> traffic to
> different LUNs not sharing the same physical disks.  Build smaller  
> RAID
> containers, and single LUNs atop those.

You get best performane with one HW RAID per OST.  And that RAID  
should be optimized for 1MB IO (ie, not. 6+p) for best performance  
without having to muck with a bunch of parameters.  If the OSTs are on  
the same drives, then there will be excessive head contention as  
different OST filesystems seek the same disks, greatly reducing  
throughput.


> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
>        http://scalableinformatics.com/jackrabbit
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss



More information about the lustre-discuss mailing list