[Lustre-discuss] short writes

Peter Grandi pg_lus at lus.for.sabi.co.UK
Sat Jul 17 02:10:55 PDT 2010


>>>> [ ... whether apps can rely on the kernel always returning a
>>>> full read or write count on file IO except at EOF on read ... ]

As it has been remarked the answer is NO.

BTW this question is not the same as the "interruptible" one.
There is a difference between the kernel being allowed to
read(2) or write(2) less than requested by the process and them
being apparently atomic.

The kernel may always return a count of bytes read or written
less than requested for any reason whatever, even if a signal
has not interrupted the operations.

The applications have to deal with it. Most applications are
written wrong (and in many other ways, e.g. how many do check or
even just '(void)' the return code from 'close' for example, and
never mind not calling 'flock' or 'fsync'), and as many kernel
writers say "userspace sucks", but most applications mistakes
matter only in infrequent cases, and when these happen users
just shrug.

The reasons why the semantics are like that have been explained
very clearly by Gabriel in his paper "Worse is better".

>>> How about a network file system waiting for server failover
>>> (especially if it is not automatic)?

>> That's not indefinite. The FS is waiting for something which
>> will eventually occur.

Here "indefinite" as to a wait duration is used in two rather
different ways. One is to say that it is "unknowable", the other
is that it is "unknown at the moment and expected to be in some
relevant sense long".

There is a fundamental difference. If the outcome (success or
failure) of an operation may or may not become known, we have a
completely different class of models of computation from the
usual Turing or Church or Von Neumann one, with rather
completely different properties from the usual one.

The halting problem does not exist, as all computations must
complete, but the outcome on completion can be undeterminate,
which the opposite of the usual class of models of computation).
Once upon time I even wrote a paper (in a very obscure journal)
on the difference between the two classes of models of
computation and why it matters a lot.

In the distributed filesystem case one is trying to simulate one
class of model of computation on another, which is simply not
possible in the edge cases (those which matter). Attempting
POSIX semantics in that case requires a lot of effort and a
considerable suspension of disbelief.

> (Assuming it's is correctly administered).

That's the key statement -- here the hidden assumption is that
"correctly administered" means that there is a central agency
that ensures that all operations have a known outcome if they
complete. If there is no central agency, all operations complete
because they eventually timeout, but whether they succeeded or
not is not always knowable.

> That IS indefinite. Indefinite just means that the limit is
> vague and/or unknown, as apposed to having a clear and well
> defined bound.

Actually that applies only to the non distributed case. In the
distributed case it means that the outcome may be absolutely
unknowable.

Supposed for example that you write a log entry to a file on a
Lustre file server, and the kernel code receives confirmation
that the write request has been sent, but then all communication
with the file server ceases. Has the log entry been written to
the file server disk?  Well, how can you figure that out? No way
(unless an admin looks at the file server and thus restarts
communications). That's "indefinite": when whether the operation
succeeded or failed cannot be known.

[ ... ]

> If a job hangs on a write because servers are unavailable, it
> should hang indefinitely until the server is restored, or
> until a human interrupts the process with a signal.

That is if one wants to preserve the illusion that a centralized
class model of computation is available when a distributed class
one is the reality. Then the human interruption is the point at
which the illusion goes away.

The better way to handle the distributed case is to design
programs knowingly for the class of distributed models of
computation, which requires completely different programming
strategies, and of course nearly everybody does not realize that
(even if some programmers of distributed system with high
reality requirements rediscover them).

> Only a human can determine how valuable that job's output is.
> If the human determines that the data is very important, and
> must finish, then they leave the job hanging and go fix the
> servers. If they decide that the data is easily reproducible
> and the compute cluster is better spent running another job
> that doesn't require the down filesystem, then they have the
> ability to abort the operation with a signal. [ ... ]

Here you are assuming though the underlying model of computation
is in the centralized class, that is the outcome (success or
failure) of an operation is knowable, even if only by the human
component of the systems.

For Lustre systems this is usually a good assumption as most
Lustre installations are centrally managed and in a single
location and hardware and software state can and will be
inspected to determine the outcome or operations.

But people are now using Lustre across wide geographical
networks and over mobs of thousands (or dozens of thousands)
of clients and servers, and in such cases it is usually not
practical to assume that the outcome of an operation is
knowable, and eventually people will learn that means a
completely different world. Or not, as the 'O_PONIES' story
about 'fsync' and barriers demonstrates.



More information about the lustre-discuss mailing list