[Lustre-discuss] OST size limitation

Andreas Dilger adilger at whamcloud.com
Thu Nov 3 11:40:14 PDT 2011


On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:
> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>> I read in the Lustre Operations Manual that there is an OST size
>> limitation of 16 TB on RHEL and 8 TB on other distributions because
>> of the ext3 file system limitation. I have a few questions about that.
>>  
>> Why is the limitation 16 TB on RHEL?
> 
> 16TB is the maximum size RedHat supports.  See http://www.redhat.com/rhel/compare/
> Larger than that requires bigger changes.
> 
> Note that whamcloud's 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ).

That is just from not wanting to force ext4 formatting for users that do
not need it.  As discussed in that bug, using '--mkfsopts=-t ext4"' allows
formatting LUNs over 16TB.

This will be the default for 1.8.7-wc because all supported distros are
only using ext4-based ldiskfs.

> Whamcloud's Lustre 2.1 (not sure you'd want to use it) claims support for 128TB LUNs.

We tested LUNs this large (filling full and verifying all data), but I don't
expect they will be needed for some time yet.

>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation?
>>  
>> What is the OST size limitation when using ext4?
> 
> 16TB with the Lustre-patched RHEL kernel.

You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the
init scripts are different.  Also, as Kevin writes, none of the >16TB fixes
are included into 1.8.5.  I would strongly recommend running 1.8.6 instead.

>> Is it preferable to use ext4 instead of ext3?
>>  
>> If the block device has more than 8 TB or 16 TB, it must be partitioned.
>> Is there a performance degradation when a device has multiple partitions
>> compared to a single partition? In other words, is it better to have
>> three 8 TB devices with one partition per device than to have one 24 TB
>> device with three partitions?
> 
> Better to have 3 separate 8TB LUNs.  Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement).

Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device.

Even with flash storage it is much better to align the IO on power-of-two
boundaries, since the erase blocks cause extra latency if there are read-
modify-write operations.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.






More information about the lustre-discuss mailing list