[Lustre-discuss] best practice for lustre clustre startup

Sat Jul 3 14:02:24 PDT 2010

[ ... ]

>> We do the fsck from the command line and look at the output.
>> If there were no filesystem modifications (this is the usual
>> case), we then start the Lustre services interactively.

> Note that if you are not running with writeback cache enabled
> on the disks, then you shouldn't have to run an fsck on the
> filesystems after a crash.

This seems to me extremely bad advice, based on these rather
extraordinarily optimistic assumptions:

> That should only be needed if the storage is faulty, or if it
> is using writeback cache without mirroring and battery backup.

This reminds me of the immortal statement "as far as we know in
our datacenter we never had an undetected error".

How do you know whether "storage is faulty" or many of the other
reaosn why metadata can get corrupted never happened?

'fsck' does metadata auditing and garbage collection and a full
scan, at least every now and then, is essential to give some
confidence that no hidden problem has been eating the metadata.

And if there is a way to at least sample check data integrity
(e.g. run 'gzip -t' on a subset of compressed files) I would run
that periodically too. Experience with storage systems induces
distrusts, never mind CERN's experiences:

  http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

Admittedly "happy go lucky", as the investment banks have shown
in the past several years with derivaties, can be a profitable
strategy (until it blows up :->).

[ ... ]