[Lustre-discuss] Aborting recovery

Brian J. Murrell Brian.Murrell at Sun.COM
Fri Mar 6 07:32:35 PST 2009


On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote:
> Thanks Brian.

NP.

> What I meant: the average batch job that wants to read from or write to
> Lustre will abort if a file cannot be accessed. The reason doesn't
> matter to the jobs or the user.

That may be so, but what I am saying is that when a lustre client wants
to perform an i/o operation on behalf of an application running on that
machine and the target it wants to do the i/o with is down, the lustre
client will wait and block the applications i/o indefinitely.

That means that unless the application has some kind of timer in it so
that it can abort the read(2)/write(2), it will wait forever as the
read(2) or write(2) system call that it issued will simply wait for the
lustre client to complete -- forever, if the target that the lustre
client wanted to do the i/o with never comes back.

> So the Lustre client may wait forever, but for the users that is
> irrelevant, they have to resubmit their jobs in any case.

But what signals them to resubmit?  A job waiting on I/O to a missing
target will just "hang" (the proper term is block) until the target
comes back.  Is there some kind of timer that aborts a job if it takes
too long?  If so, then that is pretty orthogonal to the discussion of
what happens to a lustre client during (a failed) recovery.

> I was wondering whether a client whose transactions have not been
> replayed may get into some zombie state.

No.  It should be evicted (that is why the transactions are not
replayed) and will reconnect once recovery has been aborted and the
target resumes it's normal (FULL) state.

> Of course I see in the logs of
> MDS and clients what is supposed to happen, that remainig stuff on the
> client is discarded, inodes deleted etc. In some cases this will not
> work, I'm sure. But then reboot of the client will clean up.

A reboot of the client should never be necessary to return it to the
filesystem.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090306/b59a0fce/attachment.pgp>


More information about the lustre-discuss mailing list