[Lustre-discuss] Aborting recovery

Fri Mar 6 12:43:09 PST 2009

Thomas Roth wrote:
> Brian J. Murrell wrote:
>   
>> On Fri, 2009-03-06 at 20:09 +0100, Thomas Roth wrote:
>>     
>>> But this is not what our users observe. Even on an otherwise perfectly
>>> working system, they report I/O errors on access to some files.
>>>       
>> EIO == eviction.

To be clear, while it is pretty clear that this is true in this one 
case, there are many other reasons
why clients could get EIO, including the server's path to the disk is 
down or returning errors.

>>> I  can usually see something happening in the logs of OST and client:
>>> The OST starts with "timeout on bulk PUT after 6+0s", which the OST is
>>> first "ignoring bulk IO comm error" in the hope that "client will
>>> retry".
>>>       
>> Wait a minute.  This thread is about server recovery, not communications
>> failures.  You are mixing up errors and situations here.
>>
>> Communications failures will result in timeouts on the server and that
>> will result in evictions which will result in EIOs for your
>> applications.  This has got nothing to do with server recovery though.
>>     
>
> You are right, of course, this comes from a different situation. I just
> assumed that if a client cannot cope with a 1sec-interruption due to a
> communication failure, resulting in an EIO, how can it (resp. the
> application) survive an interruption of the entire system of several hours.
> Of course, if the client does react in a different manner during server
> recovery, then also the application will see things differently.
> I guess that's what I misunderstood. In fact the client's logs during
> yesterdays recovery don't look so bad at all ;-) Just a number of
> "Request xyz sent from MDT0000-mdc to NID MGS ... timed out", as expected.
> Thanks for poiting this out.,
> Thomas
>   

As Brian said there is a difference between the _server_ detecting the 
client is "down"
and evicting it, and the _client_ continually attempting to reconnect 
and "finish" its IO
(server failover).

Failover does not "solve" network stability problems.  Basically, 
clients try forever, servers do not.

With a successful recovery, the clients can retry any operations that 
have not been acknowledged
by the server, but if a client is "down" (in reality or due to network 
issues), it is evicted so the
servers are able to reclaim locks, etc, and keep the filesystem from 
hanging due to a client.
Hanging due to a server being down is the feature that allows failover.

Kevin

>>> "Request ... has timed out
>>> (limit 7s)", "Connection to service was lost; in progress operations
>>> using this service will fail", finally "Connection restored to service".
>>>       
>> Yes.  This is a timeout and nothing to do with the subject of server
>> recovery.
>>
>> b.
>>
>>