[Lustre-discuss] NFS vs Lustre

Mon Aug 31 14:03:53 PDT 2009

On Mon, Aug 31, 2009 at 04:50:02PM -0400, Paul Nowoczynski wrote:
> Yes this is the case on server failure but I think the true similarity 
> between lustre and a locally mounted filesystem lies in the failure of a 
> client holding dirty pages.  Please correct me if I'm wrong but data 
> loss will occur should the client fail after close() but prior to the 
> set of dirty pages being committed on the OST.

The client will have DLM locks outstanding if it has dirty data, so that
the client's death can be used to detect that its open, dirty files are
now potentially corrupted.

Client death with dirty data is not all that different from process
death with dirty data in user-land.  Think of an application that does
write(2), write(2), close(2), _exit(2), but dies between writes.
Compare that to a client that dies after flushing the first of those
writes but before flushing the second, though after the application
calls close(2).  Nothing special is usually done in the first case, even
though if the process did have byte range locks outstanding, then the OS
could flag the affected file as potentially corrupted.

I don't think Lustre does actually do anything to mark files as
corrupted that Lustre could detect as potentially corrupted.  Some
applications can recover automatically -- think of databases, such as
SQLite3, or think of plain log files.  Other applications might well be
affected.  Since corruption detection in this case is heuristic, and
since the impact will vary by application, I don't think there's an easy
answer as to what Lustre ought to do about it.  Ideally we could track
the "potentially corrupt" status as an advisory meta-data item that
could be fetched with a stat(2)-like system call, and have applications
reset it when they recover.

Nico
--