[Lustre-discuss] How's this config for our Lustre setup?

Fri May 27 02:20:58 PDT 2011

On 27 May 2011, at 08:02, Kek wrote:

> Hi All
>  
> We're a research lab new to Lustre and are purchasing a small HPC cluster.  We wish to seek your comments and help on sizing the hardware.
> So far we plan to have the following:
> - 1 Head node, 2 x 6 core intel X5670, 32GB RAM, 2 x 300 GB SAS 10krpm HDDs
> - 1 MDS+MDT node, 2 x 6 core intel X5670, 24GB RAM, 2 x 500GB SATA 7.2krpm HDDs
> - 2 Storage nodes (OSS collocated with OST), each with: 2 x 4 core intel E5620, 24GB RAM, 14 x 600GB SAS 15krpm HDDs (raw 8.4TB)
> - 12 compute nodes, each with 2 x 6 core intel X5670.
> - Infiniband QDR connectivity
>  
> - Average file size: 20MB
> - Usage pattern: running parallel (MPI) models like WRF (weather model), POM (hydrodynamics model) etc
>  
> Questions:
> - Is the above hardware config all right?

It doesn't sound unreasonable for a entry-level system, a few things stick out at me though:

??T stands for a Target and refers to a single storage device, ??S stands for a Server and refers to a physical machine.  It's therefore correct to say that the MDS hosts the MDT and the OSSs host the OSTs.  Talking about OSS collocated with OST does not make much sense.

The OSS will benefit from having lots of ram, 24Gb is good but the MDT less so.  That said ram is cheap so there wouldn't be much of a saving.

Why do you have 2xHDD for the MDT?  It'll use a single device, having two only makes sense if you are using raid 1 which you should be.  See below.

SAS is probably overkill for the OST drives at this scale, SATA will be cheaper and higher capacity.

Now for the main point from the above config.

Data will be striped over all OSTs, if you get a disk failure then the data stored on that OST will be lost forever.  As files are likely to be striped this means that you will likely loose a considerable % of all data for each and every disk failure (think 80% plus - you'll be able to recover a subset of small files and parts of larger files only).  If you assume that the MTBF for a hard drive is four years (48 months) and you have 30 drives then you can expect one to fail at least every two months.

Lustre itself doesn't protect against this, it simply works at the device level, to provide some resilience against disk failure you should use RAID at some level.  At this level software raid1 or similar would be acceptable.  As above if you use SATA disks rather than SAS you may find this is both cheaper, more resilient and still give you more capacity.

> - What is the impact of putting the 2nd (failover) MDS+MDT on the Head Node?

I'm not sure you understand what you mean here, unless the data is replicated between a backup device on the head node and the MDT device this wouldn't be possible.  You could do this using DRDB I believe although the standard is to use external raid controllers multiply connected to do this.

> - Is Lustre easy to maintain?

Generally yes although the learning curve can be steep at the beginning.

> - Is Lustre reasonably stable and problem free?

Yes, it's a lot better than it used to be.

Ashley.