[Lustre-discuss] Lustre Storage Sizing- How?

Sun Jan 10 11:57:24 PST 2010

[ ... ]

>> I am considering a new storage of 30 TB usable space with a 2
>> GB/s sustained read write performance in clustered mode.

This spec is comically vague (see below) and anyhow it is going to be
quite challenging. Some people who usually know what they are doing
(CERN) currently expect around 20MB/s duplex transfer rate per TB,
and I know of one storage system that is getting 3-4GB/s duplex (on a
fresh install) with around 240 1TB drives (including overheads). If
you want to guarantee 2GB/s sustained duplex perhaps aiming for
3-4GB/s is a good idea. See at the end for a similar conclusion.

So getting to 2GB sustained and for duplex is going to require quite
careful consideration of the particular circumstances if one wants it
done with just 30GB worth of drives.

As to vague, one Lustre storage I know of was initially specified
with the same level of (un)detail, and with a bit of prodding it got
a more definite target performance envelope. That is vital.

>> But not able to figure out sizing part of it like what OSS, what
>> OST and what MDS.

Partitioning space between various types of Lustre data areas is the
least of your problems. The bigger issues is the structure of the
storage system on which Lustre runs.

>> Urgent help would be highly appreciable

People usually pay for urgent help, especially for difficult cases.
You should hire a good consultant (e.g. from Sun) who will ask you a
lot of questions.

> Hi Deval, Lustre storage sizing is largely driven by: * Capacity
> required * Performance required * Type of workload

Just about only on capacity required. Performance required given the
type of workload (both static, distribution of file sizes, and
dynamic, patterns of access) drives storage structure more than
storage sizing, and indeed later you talk about structure without
considering 

Storage and Lustre filesystem structure (not mere sizing) depends
greatly on things like size of files, sequentiality of access, size
of IO operations, number of files being concurrently worked on,
number of process concurrently working on the same file, etc.; a list
of several of these is here:

  http://www.sabi.co.uk/blog/0804apr.html#080415

A pretty vital detail here is how many clients will be in that target
of 2GB/s duplex, and distinctly for reading and writing. It could be
20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at
the same time at 2GB/s over multiple 10Gb/s links, for example.

Another interesting dimensions is whether a single storage pool is
necessary or not, or just a single namespace (to some extent Lustre
is in between) with multiple pools and suitable use of mountpoints:

  http://www.sabi.co.uk/blog/0906Jun.html#090614

It is also important to know the availability requirements for the
storage system. Does the "sustained" in "2 GB/s sustained" mean for a
stretch of time or 24x7?

Someone who asks for "Urgent help" should be nice enough to provide
all these interesting aspects of requirements to the storage
consultant they are going to hire.

> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say
> you are using SATA disks for OST. A Seagate enterprise 1TB SATA
> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go
> upto 110 MB/Sec if blocksize is really large).

Unfortunately only on the outer tracks and on a fresh filesystem. See
for example:

 https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007

   «Performance degradation on xc2
          After 6 months of production we lost half of the file system
          performance
              Problem is under investigation by HP
              We had a similar problem on xc1 which was due to fragmentation
              Current solution for defragmentation is to recreate file systems»

Note that it is not just "due to fragmentation", even without it as a
filesystem fills blocks will (usually) start being allocated from the
inner tracks and thus the raw transfer rate will eventually nearly halve:

  base# disktype /dev/sdd | grep 'device, size'
  Block device, size 931.5 GiB (1000203804160 bytes)
  base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s
  base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s

Amazingly in the "outer tracks and on fresh filesystem" case a modern
1TB SATA disk with a reasonable file system type can do 110MB/s even
with smallish block sizes:

  base# fdisk -l /dev/sdd | grep sdd3
  /dev/sdd3   *           2        1769    14201460   17  Hidden HPFS/NTFS
  base# mkfs.ext4 -q /dev/sdd3
  base# mount -t ext4 /dev/sdd3 /mnt
  base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST
  100000+0 records in
  100000+0 records out
  6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s

(BTW I have used 'ext4' because this is about Lustre, I usually
prefer JFS for various reasons)

I have been quite impressed that one can get 90MB/s "outer tracks and
on fresh filesystem" from a contemporary low power laptop drive:

  http://www.sabi.co.uk/blog/0906Jun.html#090605

> Assuming that you are looking for RAID6 protection for OST,
> you need 10 SATA disks to form a 8 TB lun.

Why would you assume that? (see below on formulaic approaches) Why
use parity RAID that is know to cause performance problems on writes
unless they are all aligned, when the only detail provided is a
target for writing?  Perhaps it is a DAQ or other recording
application given the 2GB/s goal, but perhaps it does not do large
writes.

> You will need 4 such OSTs to give you 32 TB unformatted space.
> Lets consider performance:

> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data
> disks in (8+2) RAID6 set]. But you have to cater for overhead of
> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware
> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6
> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total
> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec].

That "4 X 720MB/Sec" applies only if there are at least 4 stripes per
file and they get written in bulk, as you imply immediately below, or
there are at least 4 files being written at the same time and they
end up on different OSTs. The very vague requirement for "clustered
mode" does not quite make clear which one.

> So, now you have a storage system that delivers 32 TB unformatted
> space and 2.8 GB/Sec of performance for large sequential read/write
> workload.

The read and write performance may be quite different on many
workloads because of the RAID6 (stripe alignment), even if probably
"large sequential" is going to be fine, and again only if the
concurrency is just right and on outer tracks on a fresh filesystem.

> If you are planning to have mixed or small io workload and still
> want to achieve 2 GB/Sec throughput, you have to double the specs.

Why just double? Why not consider other storage systems like RAID10
or SSDs?

> Small, random IO (think of home directories) kills storage
> performance.

Depends on the storage system...

> Lets size MDS now.

> It is a good idea to use FC or SAS disks for MDS as they spin at
> higher rate and have better IOPS performance. For example, lets
> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put
> 4 such SAS disks in RAID10 configuration for MDT which will give
> you 600 GB of unformatted space. [ ... ]

The MDS is another story indeed.

But I seem to detect here a formulaic approach: the Lustre "don't
need to think" approach seems to be SAS RAID10 for metadata and SATA
RAID6 for data, and this is what is being discussed here, straight
out of the 3-ring binder, without asking any further questions
despite the extreme vagueness of the target. Which is mostly better
BTW than what a site I know got from EMC as their "don't need to
think" formula at the time seemed to be RAID3 of all things.

Fine (perhaps), but I have a different formulaic approach: without
knowing all the details, and even if in some cases parity RAID does
make sense:

  http://www.sabi.co.uk/blog/0709sep.html#070923b

my "generic" formula (and apparently shared by several academic sites
that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or
thei more recent equivalents) and RAID10 a bunch of disks inside them
(and then use JFS/XFS or Lustre on top, possibly with DRBD between).
With Lustre it is easy then to aggregate them by spreading the OSTs
across multiple "Thumper"s.

In this case given that he goal is roughly 70MB/s duplex sustained
per TB of storage, which is rather high, so I would use either SSDs
or lots of small fast SAS drives for data (or lots of large SATA ones
with the data partition only in the outer 1/3 of the disk, which some
people call "short stroking").

Depending on how big the writes are and how big the files are and the
degree of concurrency, and the availability target, and all the other
important aspects of the requirements, most importantly the read and
write access patterns, as writing and then reading implies quite a
bit of head movement.

If we assume the 20MB/s duplex rule per TB that CERN uses for bulk
storage, that translates to 100x SATA 1TB drives, or around 200x 1TB
with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but
higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea.

But the details matter a great deal. Your mileage may vary.