[Lustre-discuss] best practice for lustre clustre startup

Fri Jul 2 00:01:35 PDT 2010

On 2010-07-01, at 11:52, Craig Prescott <prescott at hpc.ufl.edu> wrote:
> We do the fsck from the command line and look at the output.  If there 
> were no filesystem modifications (this is the usual case), we then start 
> the Lustre services interactively.  

Note that if you are not running with writeback cache enabled on the disks, then you shouldn't have to run an fsck on the filesystems after a crash. That should only be needed if the storage is faulty, or if it is using writeback cache without mirroring and battery backup. 

> If there were modifications from 
> fsck, we'll generally fsck it again and verify there were no further 
> modifications.  If 'fsck -f -p' fails, we'll fsck interactively or just 
> go whole hog and 'fsck -f -y'.

It's always a good idea to run fsck in a manner that logs the output, either under 'script' or similar tool. 

> I imagine you could achieve an "automated startup following failure" at 
> least most of the time with an init script that does an 'fsck -f -p' on 
> the associated OSTs/MDT if the node is coming back up from a crash or 
> power outage.  

Note that if you do this you should run fsck under the control of the HA manager, to avoid both nodes running fsck at the same time. The Lustre-patched e2fsck will refuse to do this if you have mmp enabled (which is done automatically if the Lustre filesystems are formatted with failover enabled, but can also be enabled manually afterward. 

Also note that if you are using software  RAID or LVM that it should also only be configured under the control of the HA manager. 

> We once ran a cluster with lustre
> We bought from a guy named Buster
> It ran for a year
> with nary a tear
> A complaint we could not muster

Awesome. :-)

> Lisa Giacchetti wrote:
>> Hello,
>> I have recently installed a lustre cluster which is in a test phase now 
>> but will potentially be in 24x7 production if its accepted.
>> I would like input from the list on what the recommendations/best 
>> practices are for configuration of a lustre cluster startup.
>> Is it advisable to have lustre on the various server pieces 
>> (mgs/mdt/oss's) start automatically? If not why not?
>> If you try to start it and there is a very serious problem will it 
>> abort the startup or just continue on blindly?
>> 
>> Again this is going to need to be a 24x7 service for a compute facility 
>> that which has global access (ie someone is always
>> up and running something). We'd like  to be able to at least get the 
>> service back up in an automated way if at all possible and then debug
>> problems when the support staff are awake/available.
>> 
>> Lisa Giacchetti
>> 
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss