[Lustre-discuss] short writes

Christopher J. Morrone morrone2 at llnl.gov
Thu Jul 15 16:56:35 PDT 2010


On 07/08/2010 04:51 PM, John Hammond wrote:

>> How about a network file system waiting for server failover
>> (especially if it is not automatic)?
>
> That's not indefinite.  The FS is waiting for something which will
> eventually occur.  (Assuming it's is correctly administered).

That IS indefinite.  Indefinite just means that the limit is vague 
and/or unknown, as apposed to having a clear and well defined bound.

In the IO context, any operation that is unbounded (indefinite) may take 
a very long time in human terms, and therefore should be interruptible. 
  It is just not very reasonable to have a process stuck unkillable for 
days.  But on the other hand, we don't want it timing out either if the 
data is valuable and we are willing to wait a day or two for hardware 
repairs.

Even ignoring the fact that Lustre's behavior is allowed by the POSIX 
spec, I believe that Lustre is doing the Right Thing.

If a job hangs on a write because servers are unavailable, it should 
hang indefinitely until the server is restored, or until a human 
interrupts the process with a signal.

Only a human can determine how valuable that job's output is.  If the 
human determines that the data is very important, and must finish, then 
they leave the job hanging and go fix the servers.  If they decide that 
the data is easily reproducible and the compute cluster is better spent 
running another job that doesn't require the down filesystem, then they 
have the ability to abort the operation with a signal.

Now all that said, there may be an argument to be made that SIGSTOP and 
SIGCONT should not be signals that interrupt Lustre client operations.



More information about the lustre-discuss mailing list