[Lustre-discuss] small/inexpensive cluster design

Thu Apr 21 11:08:32 PDT 2011

On 2011-04-21, at 11:52 AM, Andrew Uselton wrote:
>   I had a question from a colleague and did not have a ready answer. 
> What is the community's experience with putting together a small and inexpensive cluster that serves Lustre from (some of) the compute nodes's local disks?
> 
> They have run some simple tests with using a) just local disc, b) simple NFS
> service mounted to compute nodes, and c) Lustre with OSS and MDS on the same
> node.
> 
> A typical workload for them is to compile the "Visit" visualization package.
> On a local disk this takes 2 to 3 hours. On NFS it was closer to 24 hours, and
> on the small Lustre example it was about 5 hours. Now they'd like to go a
> little further and try to find a Lustre solution that would improve performance
> as compared to local disk. Their workload will be mostly metadata intensive
> rather than bulk I/O intensive. Is there any experience like that out there?

Running client-on-OSS is a configuration that isn't "officially" supported, but I've done it for ages on my home system.  There are potential memory deadlocks if the client is consuming a lot of RAM and also doing heavy IO, but over time we've removed a lot of them.

That said, no harm in trying this, and it is relatively straight forward to set up.  One important factor is to ensure that the OSTs on a node are mounted before the client mountpoint (e.g. via fstab).

Unfortunately, there is as yet no MDS policy that would preferentially allocate and store file objects on an OST local to the client.  That would be an interesting optimization, and not too hard for someone to implement. 

I also heard once about someone using Lustre OSTs backed by a ramdisk for truly "scratch" filesystems that were quite fast.  The filesystem would lose data whenever a node rebooted, but is interesting for some limited use cases.

That said, there are also parallel compilation tools, and ccache that can speed up compiles dramatically, and they don't depend on shared storage at all.  That doesn't help if compilation isn't their only workload, but just worth mentioning.

> Notes from the one asking the question:
> ----------------------------------------------------------
>  What I would like to do now is to develop the cheapest
>  small cluster possible that still has good I/O
>  performance. NetAps raise the cost significantly. Also, I
> 
>  think the whole system must come out of the box with the
>  application and all dependencies built and good I/O.
> 
> 
>  So one possible way would be a system with a head node and
>  N compute nodes, each with multiple CPUs and cores, of
>  course. I can then imagine a Lustre file system with the
>  MDS on the head node and perhaps M OSSs on the compute
>  nodes, which then serve up their local disks. Of course,
>  now the compute nodes are running both the computational
>  application (on all cores likely) and 0 or 1 OSS.
> 
>  It sounds like from what you are saying that at a minimum
>  I would need two interfaces per node: one over which the
>  MPI communication goes for the apps, and one for serving
>  the Lustre file system on those nodes which are serving
>  that. Is this right?
> 
>  Is this a reasonable direction to go?  (Having both OSS and
>  computation on some nodes.)
> 
>  Are there examples of good systems designs out there?
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.