[Lustre-discuss] Zero-admin Lustre?

Tue Dec 20 14:07:07 PST 2011

So From My experience I'll try to answer the questions...  and then The experts will probably correct me..

So my question is, once a Lustre cluster is set up and configured, how much administration does it require in practice?

	Not much, assuming users don't abuse it, fill it up, etc  standard sysadmin of the nodes, updates,etc

More precisely:  Apart from hardware failures, how often (and under what circumstances) should I expect the file system itself to lose integrity and require manual intervention?
	By design None..   in reality there are time that a filesystem may need some help.  Mostly these are cause by hardware issues.

For example, if someone hard resets a client, will the file system always recover automatically?  How about if they reset a server (MDS or OSS)?  (Obviously unwritten data at the time of such resets could be lost or otherwise damaged; that is not what I mean.  I mean, should I expect to need to manually run fsck, or to track down and release locks, or to do anything else to restore the consistency of the
cluster?)

	After normal in-flight dataloss, and the Servers are back online, the cluster should recover into a usable state.  Given some help the filesystem can come back without an OSS, and run until it can be repaired(with empty files for the missing OST data)

For another example, if the file system runs out of space, can I recover from that merely by deleting some files?  Or would additional Lustre-specific action be needed to restore the cluster to a consistent state?
	Yes deleting files frees up space.  But you have to consider there are many places you can run out of space.  Each OST has a space limit, and if a file you are appending to resides on a full one, removing files from other OST's will not help.  New files will make an attempt to land on low-used OST's

In general, how "self-healing" is a Lustre cluster?
IT does pretty good once you get past the hardware issues, and its configured correctly.  Most of the time I feel very comfortable just telling other admins to just reboot a node, or just to failover a node for maintenance, since luste recovers and users only see short pauses in IO, if at all.

Evan Felix