[Lustre-discuss] short writes
Christopher J. Morrone
morrone2 at llnl.gov
Thu Jul 15 16:56:35 PDT 2010
On 07/08/2010 04:51 PM, John Hammond wrote:
>> How about a network file system waiting for server failover
>> (especially if it is not automatic)?
>
> That's not indefinite. The FS is waiting for something which will
> eventually occur. (Assuming it's is correctly administered).
That IS indefinite. Indefinite just means that the limit is vague
and/or unknown, as apposed to having a clear and well defined bound.
In the IO context, any operation that is unbounded (indefinite) may take
a very long time in human terms, and therefore should be interruptible.
It is just not very reasonable to have a process stuck unkillable for
days. But on the other hand, we don't want it timing out either if the
data is valuable and we are willing to wait a day or two for hardware
repairs.
Even ignoring the fact that Lustre's behavior is allowed by the POSIX
spec, I believe that Lustre is doing the Right Thing.
If a job hangs on a write because servers are unavailable, it should
hang indefinitely until the server is restored, or until a human
interrupts the process with a signal.
Only a human can determine how valuable that job's output is. If the
human determines that the data is very important, and must finish, then
they leave the job hanging and go fix the servers. If they decide that
the data is easily reproducible and the compute cluster is better spent
running another job that doesn't require the down filesystem, then they
have the ability to abort the operation with a signal.
Now all that said, there may be an argument to be made that SIGSTOP and
SIGCONT should not be signals that interrupt Lustre client operations.
More information about the lustre-discuss
mailing list