[Lustre-discuss] Lustre Storage Sizing- How?

Sun Jan 10 15:12:04 PST 2010

Peter,

You have provided an excellent explanation. One thing that was clearly 
missing from my answer is "number of Lustre clients required to get 2 
GB/Sec sustained unidirectional throughput".

If Lustre clients are going to be connected via standard gigabit 
ethernet, which gets maximum of 110 MB/Sec (almost impossible to achieve 
in real life). If we consider a modest 50 MB/Sec/client over GbE, you 
will need about 40 clients writing a single 4 way striped file with 
large blocksize to reach 2 GB/Sec.

With 10 Gigabit ethernet, you will need at least 4-5 clients to get 2 
GB/Sec aggregate throughput while writing a single 4 way striped file 
with large blocksize.

And, as Peter said, this performance is for a fresh Lustre filesystem 
with only outer platters of disk used.

On a different note, we have hacked sgpdd-survey tool that comes with 
Lustre-iokit to benchmark whole disk. You can get details in Bugzilla 
Bug 17218. In my experience, performance of IO drops by more than 60 % 
when disks start using inner platters.

Cheers,
_Atul

Peter Grandi wrote:
> [ ... ]
>
>   
>>> I am considering a new storage of 30 TB usable space with a 2
>>> GB/s sustained read write performance in clustered mode.
>>>       
>
> This spec is comically vague (see below) and anyhow it is going to be
> quite challenging. Some people who usually know what they are doing
> (CERN) currently expect around 20MB/s duplex transfer rate per TB,
> and I know of one storage system that is getting 3-4GB/s duplex (on a
> fresh install) with around 240 1TB drives (including overheads). If
> you want to guarantee 2GB/s sustained duplex perhaps aiming for
> 3-4GB/s is a good idea. See at the end for a similar conclusion.
>
> So getting to 2GB sustained and for duplex is going to require quite
> careful consideration of the particular circumstances if one wants it
> done with just 30GB worth of drives.
>
> As to vague, one Lustre storage I know of was initially specified
> with the same level of (un)detail, and with a bit of prodding it got
> a more definite target performance envelope. That is vital.
>
>   
>>> But not able to figure out sizing part of it like what OSS, what
>>> OST and what MDS.
>>>       
>
> Partitioning space between various types of Lustre data areas is the
> least of your problems. The bigger issues is the structure of the
> storage system on which Lustre runs.
>
>   
>>> Urgent help would be highly appreciable
>>>       
>
> People usually pay for urgent help, especially for difficult cases.
> You should hire a good consultant (e.g. from Sun) who will ask you a
> lot of questions.
>
>   
>> Hi Deval, Lustre storage sizing is largely driven by: * Capacity
>> required * Performance required * Type of workload
>>     
>
> Just about only on capacity required. Performance required given the
> type of workload (both static, distribution of file sizes, and
> dynamic, patterns of access) drives storage structure more than
> storage sizing, and indeed later you talk about structure without
> considering 
>
> Storage and Lustre filesystem structure (not mere sizing) depends
> greatly on things like size of files, sequentiality of access, size
> of IO operations, number of files being concurrently worked on,
> number of process concurrently working on the same file, etc.; a list
> of several of these is here:
>
>   http://www.sabi.co.uk/blog/0804apr.html#080415
>
> A pretty vital detail here is how many clients will be in that target
> of 2GB/s duplex, and distinctly for reading and writing. It could be
> 20 1Gb/s clients writing at 100MBs/, and 1 analysis client reading at
> the same time at 2GB/s over multiple 10Gb/s links, for example.
>
> Another interesting dimensions is whether a single storage pool is
> necessary or not, or just a single namespace (to some extent Lustre
> is in between) with multiple pools and suitable use of mountpoints:
>
>   http://www.sabi.co.uk/blog/0906Jun.html#090614
>
> It is also important to know the availability requirements for the
> storage system. Does the "sustained" in "2 GB/s sustained" mean for a
> stretch of time or 24x7?
>
> Someone who asks for "Urgent help" should be nice enough to provide
> all these interesting aspects of requirements to the storage
> consultant they are going to hire.
>
>   
>> Lustre 1.8.1.1 has a limit of 8 TB for an individual OST. Lets say
>> you are using SATA disks for OST. A Seagate enterprise 1TB SATA
>> disk can do around 90 MB/Sec with 1 MB blocksize using dd (can go
>> upto 110 MB/Sec if blocksize is really large).
>>     
>
> Unfortunately only on the outer tracks and on a fresh filesystem. See
> for example:
>
>  https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007
>
>    «Performance degradation on xc2
>           After 6 months of production we lost half of the file system
>           performance
>               Problem is under investigation by HP
>               We had a similar problem on xc1 which was due to fragmentation
>               Current solution for defragmentation is to recreate file systems»
>
> Note that it is not just "due to fragmentation", even without it as a
> filesystem fills blocks will (usually) start being allocated from the
> inner tracks and thus the raw transfer rate will eventually nearly halve:
>
>   base# disktype /dev/sdd | grep 'device, size'
>   Block device, size 931.5 GiB (1000203804160 bytes)
>   base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=0
>   1000+0 records in
>   1000+0 records out
>   1048576000 bytes (1.0 GB) copied, 9.53604 seconds, 110 MB/s
>   base# dd bs=1M count=1000 iflag=direct if=/dev/sdd of=/dev/null skip=950000
>   1000+0 records in
>   1000+0 records out
>   1048576000 bytes (1.0 GB) copied, 18.3843 seconds, 57.0 MB/s
>
> Amazingly in the "outer tracks and on fresh filesystem" case a modern
> 1TB SATA disk with a reasonable file system type can do 110MB/s even
> with smallish block sizes:
>
>   base# fdisk -l /dev/sdd | grep sdd3
>   /dev/sdd3   *           2        1769    14201460   17  Hidden HPFS/NTFS
>   base# mkfs.ext4 -q /dev/sdd3
>   base# mount -t ext4 /dev/sdd3 /mnt
>   base# dd bs=64k count=100000 conv=fsync if=/dev/zero of=/mnt/TEST
>   100000+0 records in
>   100000+0 records out
>   6553600000 bytes (6.6 GB) copied, 59.7662 seconds, 110 MB/s
>
> (BTW I have used 'ext4' because this is about Lustre, I usually
> prefer JFS for various reasons)
>
> I have been quite impressed that one can get 90MB/s "outer tracks and
> on fresh filesystem" from a contemporary low power laptop drive:
>
>   http://www.sabi.co.uk/blog/0906Jun.html#090605
>
>   
>> Assuming that you are looking for RAID6 protection for OST,
>> you need 10 SATA disks to form a 8 TB lun.
>>     
>
> Why would you assume that? (see below on formulaic approaches) Why
> use parity RAID that is know to cause performance problems on writes
> unless they are all aligned, when the only detail provided is a
> target for writing?  Perhaps it is a DAQ or other recording
> application given the 2GB/s goal, but perhaps it does not do large
> writes.
>
>   
>> You will need 4 such OSTs to give you 32 TB unformatted space.
>> Lets consider performance:
>>     
>
>   
>> Ideally, you should get 720 MB/Sec/OST [ 90 MB/sec/disk X 8 data
>> disks in (8+2) RAID6 set]. But you have to cater for overhead of
>> software/hardware RAID and limits of SAS PCIe HCA (or FC hardware
>> RAID HCA). A 4gbps FC HCA tops out at 500 MB/Sec so you need 5-6
>> FC HCAs to utilize storage bandwidth of 4 RAID6 OSTs [Total
>> bandwidth = 4 X 720 MB/Sec/OST = 2.8 GB/Sec].
>>     
>
> That "4 X 720MB/Sec" applies only if there are at least 4 stripes per
> file and they get written in bulk, as you imply immediately below, or
> there are at least 4 files being written at the same time and they
> end up on different OSTs. The very vague requirement for "clustered
> mode" does not quite make clear which one.
>
>   
>> So, now you have a storage system that delivers 32 TB unformatted
>> space and 2.8 GB/Sec of performance for large sequential read/write
>> workload.
>>     
>
> The read and write performance may be quite different on many
> workloads because of the RAID6 (stripe alignment), even if probably
> "large sequential" is going to be fine, and again only if the
> concurrency is just right and on outer tracks on a fresh filesystem.
>
>   
>> If you are planning to have mixed or small io workload and still
>> want to achieve 2 GB/Sec throughput, you have to double the specs.
>>     
>
> Why just double? Why not consider other storage systems like RAID10
> or SSDs?
>
>   
>> Small, random IO (think of home directories) kills storage
>> performance.
>>     
>
> Depends on the storage system...
>
>   
>> Lets size MDS now.
>>     
>
>   
>> It is a good idea to use FC or SAS disks for MDS as they spin at
>> higher rate and have better IOPS performance. For example, lets
>> consider Seagate enterprise 15 K rpm 300 GB SAS disks. You can put
>> 4 such SAS disks in RAID10 configuration for MDT which will give
>> you 600 GB of unformatted space. [ ... ]
>>     
>
> The MDS is another story indeed.
>
> But I seem to detect here a formulaic approach: the Lustre "don't
> need to think" approach seems to be SAS RAID10 for metadata and SATA
> RAID6 for data, and this is what is being discussed here, straight
> out of the 3-ring binder, without asking any further questions
> despite the extreme vagueness of the target. Which is mostly better
> BTW than what a site I know got from EMC as their "don't need to
> think" formula at the time seemed to be RAID3 of all things.
>
> Fine (perhaps), but I have a different formulaic approach: without
> knowing all the details, and even if in some cases parity RAID does
> make sense:
>
>   http://www.sabi.co.uk/blog/0709sep.html#070923b
>
> my "generic" formula (and apparently shared by several academic sites
> that use Lustre for HPC storage) is to get Sun X4500 "Thumper"s (or
> thei more recent equivalents) and RAID10 a bunch of disks inside them
> (and then use JFS/XFS or Lustre on top, possibly with DRBD between).
> With Lustre it is easy then to aggregate them by spreading the OSTs
> across multiple "Thumper"s.
>
> In this case given that he goal is roughly 70MB/s duplex sustained
> per TB of storage, which is rather high, so I would use either SSDs
> or lots of small fast SAS drives for data (or lots of large SATA ones
> with the data partition only in the outer 1/3 of the disk, which some
> people call "short stroking").
>
> Depending on how big the writes are and how big the files are and the
> degree of concurrency, and the availability target, and all the other
> important aspects of the requirements, most importantly the read and
> write access patterns, as writing and then reading implies quite a
> bit of head movement.
>
> If we assume the 20MB/s duplex rule per TB that CERN uses for bulk
> storage, that translates to 100x SATA 1TB drives, or around 200x 1TB
> with RAID10 (spread around 5-6 "Thumper"s). Or perhap smaller but
> higher IOP/s SAS 15k drives. Perhaps large SSDs would be a nice idea.
>
> But the details matter a great deal. Your mileage may vary.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>