[Lustre-discuss] Zero-admin Lustre?

Patrick J. LoPresti lopresti at gmail.com
Thu Dec 8 10:54:28 PST 2011

Hello.  I am considering using Lustre for my application, which is a
"black box" storage component of a larger system.  There could be tens
or hundreds of such systems in deployment, and their storage
components need to "just work" with minimal caretaking.

So my question is, once a Lustre cluster is set up and configured, how
much administration does it require in practice?

More precisely:  Apart from hardware failures, how often (and under
what circumstances) should I expect the file system itself to lose
integrity and require manual intervention?

For example, if someone hard resets a client, will the file system
always recover automatically?  How about if they reset a server (MDS
or OSS)?  (Obviously unwritten data at the time of such resets could
be lost or otherwise damaged; that is not what I mean.  I mean, should
I expect to need to manually run fsck, or to track down and release
locks, or to do anything else to restore the consistency of the

For another example, if the file system runs out of space, can I
recover from that merely by deleting some files?  Or would additional
Lustre-specific action be needed to restore the cluster to a
consistent state?

In general, how "self-healing" is a Lustre cluster?

I am interested in both theoretical (i.e. design goals) and practical
(i.e. experience-based) answers to these questions.

Thank you.

 - Pat

