[Lustre-discuss] best practice for lustre clustre startup

Thu Jul 1 10:52:41 PDT 2010

Hi Lisa;

We don't start the services automatically on our servers.  We don't have 
so many Lustre servers that this is a big problem (17 total), and it is 
pretty rare for one of them to go down unexpectedly.

If one of our Lustre server node does go down unexpectedly, we fsck the 
associated OSTs/MDT before starting up Lustre services again.  I think 
you will want to do the same.

We do the fsck from the command line and look at the output.  If there 
were no filesystem modifications (this is the usual case), we then start 
the Lustre services interactively.  If there were modifications from 
fsck, we'll generally fsck it again and verify there were no further 
modifications.  If 'fsck -f -p' fails, we'll fsck interactively or just 
go whole hog and 'fsck -f -y'.

I imagine you could achieve an "automated startup following failure" at 
least most of the time with an init script that does an 'fsck -f -p' on 
the associated OSTs/MDT if the node is coming back up from a crash or 
power outage.  If there aren't any modifications made by fsck, your init 
script could mount the storage.  If 'fsck -f -p' bails out, you might 
send out an "I need help" email or something.

Cheers,
Craig Prescott
UF HPC Center

We once ran a cluster with lustre
We bought from a guy named Buster
It ran for a year with nary a tear
A complaint we could not muster

Lisa Giacchetti wrote:
> Hello,
>  I have recently installed a lustre cluster which is in a test phase now 
> but will potentially be in 24x7 production if its accepted.
>  I would like input from the list on what the recommendations/best 
> practices are for configuration of a lustre cluster startup.
>  Is it advisable to have lustre on the various server pieces 
> (mgs/mdt/oss's) start automatically? If not why not?
>  If you try to start it and there is a very serious problem will it 
> abort the startup or just continue on blindly?
> 
>  Again this is going to need to be a 24x7 service for a compute facility 
> that which has global access (ie someone is always
>  up and running something). We'd like  to be able to at least get the 
> service back up in an automated way if at all possible and then debug
>  problems when the support staff are awake/available.
> 
>  Lisa Giacchetti
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss