[Lustre-discuss] OST size limitation

Thu Aug 2 09:21:34 PDT 2012

On 2012-08-02, at 8:50, "Christopher J.Walker" <C.J.Walker at qmul.ac.uk> wrote:
> Picking up on an old message...
> 
> On 03/11/11 18:40, Andreas Dilger wrote:
>> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:
>>> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>>>> I read in the Lustre Operations Manual that there is an OST size
>>>> limitation of 16 TB on RHEL and 8 TB on other distributions because
>>>> of the ext3 file system limitation. I have a few questions about that.
>>>> 
>>>> Why is the limitation 16 TB on RHEL?
>>> 
>>> 16TB is the maximum size RedHat supports.  See http://www.redhat.com/rhel/compare/
>>> Larger than that requires bigger changes.
>>> 
>>> Note that whamcloud's 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ).
>> 
>> That is just from not wanting to force ext4 formatting for users that do
>> not need it.  As discussed in that bug, using '--mkfsopts=-t ext4"' allows
>> formatting LUNs over 16TB.
>> 
>> This will be the default for 1.8.7-wc because all supported distros are
>> only using ext4-based ldiskfs.
>> 
>>> Whamcloud's Lustre 2.1 (not sure you'd want to use it) claims support for 128TB LUNs.
>> 
>> We tested LUNs this large (filling full and verifying all data), but I don't
>> expect they will be needed for some time yet.
> 
> They would be useful to us with 1.8.8-wc1. We have disk servers where we
> want to use 30TB OSTs - this is annoyingly just over the 24TiB limit [1].
> 
> When I try to create a filesystem, it fails with:
> 
> mkfs.lustre: Unable to mount /dev/sdb: Invalid argument
> mkfs.lustre FATAL: failed to write local files
> mkfs.lustre: exiting with 22 (Invalid argument)
> 
> And I see the following in /var/log/messages [2]:
> 
> LDISKFS-fs does not support filesystems greater than 24TB and can cause
> data corruption.Use "force_over_24tb" mount option to override.
> 
> Is this warning just being cautious - or are there known issues? Has
> there been testing of this in the last 9 months?

It is about being cautious and only allowing what we have tested.  There are no limits that I'm aware of that differentiate between 24TB and 32TB, but we never tested more that this.

At a very minimum, you need to be running e2fsprogs-1.42.3-wc1, since it fixes one critical bug for filesystems larger than 16TB (which was proportionally more likely to be hit for larger filesystems).

It would also be useful if you report to the list about you success or failure at this size, since I don't think many sites are using LUNs this large yet. 

>>>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation?
>>>> 
>>>> What is the OST size limitation when using ext4?
>>> 
>>> 16TB with the Lustre-patched RHEL kernel.
>> 
>> You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the
>> init scripts are different.  Also, as Kevin writes, none of the >16TB fixes
>> are included into 1.8.5.  I would strongly recommend running 1.8.6 instead.
>> 
>>>> Is it preferable to use ext4 instead of ext3?
>>>> 
>>>> If the block device has more than 8 TB or 16 TB, it must be partitioned.
>>>> Is there a performance degradation when a device has multiple partitions
>>>> compared to a single partition? In other words, is it better to have
>>>> three 8 TB devices with one partition per device than to have one 24 TB
>>>> device with three partitions?
>>> 
>>> Better to have 3 separate 8TB LUNs.  Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement).
> 
> The advantage of 1 partition of 30TB is we avoid losing the space taken
> up by creating multiple LUNs and the performance degradation of
> different partitions.
> 
>> Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
>> flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device.
>> 
>> Even with flash storage it is much better to align the IO on power-of-two
>> boundaries, since the erase blocks cause extra latency if there are read-
>> modify-write operations.
> 
> 
> Chris
> 
> [1] We do appreciate that with 12*3TB disks as a RAID 6 array we may not
> get the performance of an 8+2 array, but we would like to keep the
> capacity (and the performance of older servers with 12*2TB disks is
> "good enough").
> 
> [2] It would be helpful if I saw this error on the terminal too.

It is not possible to print messages from  within the kernel to the terminal. 

> PS man mkfs.lustre is somewhat out of date - it says:
>     mkfs.lustre is part of the Lustre(7) filesystem package and  is
>       available from Sun Microsystems via
>       http://downloads.lustre.org/

I believe we've fixed this for newer 2.x releases, but not for 1.8.

Cheers, Andreas