[Lustre-discuss] OST size limitation
C.J.Walker at qmul.ac.uk
Thu Aug 2 08:50:41 PDT 2012
Picking up on an old message...
On 03/11/11 18:40, Andreas Dilger wrote:
> On 2011-11-02, at 2:09 PM, Kevin Van Maren wrote:
>> On Nov 2, 2011, at 1:48 PM, Charland, Denis wrote:
>>> I read in the Lustre Operations Manual that there is an OST size
>>> limitation of 16 TB on RHEL and 8 TB on other distributions because
>>> of the ext3 file system limitation. I have a few questions about that.
>>> Why is the limitation 16 TB on RHEL?
>> 16TB is the maximum size RedHat supports. See http://www.redhat.com/rhel/compare/
>> Larger than that requires bigger changes.
>> Note that whamcloud's 1.8.6-wc1 claimed support for 24TB LUNs (but see http://jira.whamcloud.com/browse/LU-419 ).
> That is just from not wanting to force ext4 formatting for users that do
> not need it. As discussed in that bug, using '--mkfsopts=-t ext4"' allows
> formatting LUNs over 16TB.
> This will be the default for 1.8.7-wc because all supported distros are
> only using ext4-based ldiskfs.
>> Whamcloud's Lustre 2.1 (not sure you'd want to use it) claims support for 128TB LUNs.
> We tested LUNs this large (filling full and verifying all data), but I don't
> expect they will be needed for some time yet.
They would be useful to us with 1.8.8-wc1. We have disk servers where we
want to use 30TB OSTs - this is annoyingly just over the 24TiB limit .
When I try to create a filesystem, it fails with:
mkfs.lustre: Unable to mount /dev/sdb: Invalid argument
mkfs.lustre FATAL: failed to write local files
mkfs.lustre: exiting with 22 (Invalid argument)
And I see the following in /var/log/messages :
LDISKFS-fs does not support filesystems greater than 24TB and can cause
data corruption.Use "force_over_24tb" mount option to override.
Is this warning just being cautious - or are there known issues? Has
there been testing of this in the last 9 months?
>>> I plan to use Lustre 1.8.5 on Fedora 12 for a new Lustre file system. What will be the OST size limitation?
>>> What is the OST size limitation when using ext4?
>> 16TB with the Lustre-patched RHEL kernel.
> You will have problems running the 1.8.5 RHEL5 kernel on FC 12 because the
> init scripts are different. Also, as Kevin writes, none of the >16TB fixes
> are included into 1.8.5. I would strongly recommend running 1.8.6 instead.
>>> Is it preferable to use ext4 instead of ext3?
>>> If the block device has more than 8 TB or 16 TB, it must be partitioned.
>>> Is there a performance degradation when a device has multiple partitions
>>> compared to a single partition? In other words, is it better to have
>>> three 8 TB devices with one partition per device than to have one 24 TB
>>> device with three partitions?
>> Better to have 3 separate 8TB LUNs. Different OSTs forcing the same drive heads to move to opposite parts of the disk does degrade performance (with a single OST moving the drive heads, the block allocator tries to minimize movement).
The advantage of 1 partition of 30TB is we avoid losing the space taken
up by creating multiple LUNs and the performance degradation of
> Not only is the seeking evil (talk to Kevin if you want to run 24TB OSTs on
> flash :-), but the 512-byte sector offset added by the partition table will cause all IO to be misaligned to the underlying device.
> Even with flash storage it is much better to align the IO on power-of-two
> boundaries, since the erase blocks cause extra latency if there are read-
> modify-write operations.
 We do appreciate that with 12*3TB disks as a RAID 6 array we may not
get the performance of an 8+2 array, but we would like to keep the
capacity (and the performance of older servers with 12*2TB disks is
 It would be helpful if I saw this error on the terminal too.
PS man mkfs.lustre is somewhat out of date - it says:
mkfs.lustre is part of the Lustre(7) filesystem package and is
available from Sun Microsystems via
More information about the lustre-discuss