[Lustre-discuss] best practice for lustre clustre startup
Craig Prescott
prescott at hpc.ufl.edu
Thu Jul 1 10:52:41 PDT 2010
Hi Lisa;
We don't start the services automatically on our servers. We don't have
so many Lustre servers that this is a big problem (17 total), and it is
pretty rare for one of them to go down unexpectedly.
If one of our Lustre server node does go down unexpectedly, we fsck the
associated OSTs/MDT before starting up Lustre services again. I think
you will want to do the same.
We do the fsck from the command line and look at the output. If there
were no filesystem modifications (this is the usual case), we then start
the Lustre services interactively. If there were modifications from
fsck, we'll generally fsck it again and verify there were no further
modifications. If 'fsck -f -p' fails, we'll fsck interactively or just
go whole hog and 'fsck -f -y'.
I imagine you could achieve an "automated startup following failure" at
least most of the time with an init script that does an 'fsck -f -p' on
the associated OSTs/MDT if the node is coming back up from a crash or
power outage. If there aren't any modifications made by fsck, your init
script could mount the storage. If 'fsck -f -p' bails out, you might
send out an "I need help" email or something.
Cheers,
Craig Prescott
UF HPC Center
We once ran a cluster with lustre
We bought from a guy named Buster
It ran for a year with nary a tear
A complaint we could not muster
Lisa Giacchetti wrote:
> Hello,
> I have recently installed a lustre cluster which is in a test phase now
> but will potentially be in 24x7 production if its accepted.
> I would like input from the list on what the recommendations/best
> practices are for configuration of a lustre cluster startup.
> Is it advisable to have lustre on the various server pieces
> (mgs/mdt/oss's) start automatically? If not why not?
> If you try to start it and there is a very serious problem will it
> abort the startup or just continue on blindly?
>
> Again this is going to need to be a 24x7 service for a compute facility
> that which has global access (ie someone is always
> up and running something). We'd like to be able to at least get the
> service back up in an automated way if at all possible and then debug
> problems when the support staff are awake/available.
>
> Lisa Giacchetti
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
More information about the lustre-discuss
mailing list