[Lustre-discuss] Another server question.

Wed Feb 4 03:34:07 PST 2009

On Feb 4, 2009, at 4:33 AM, Andreas Dilger wrote:

> On Feb 03, 2009  12:21 -0500, Charles Taylor wrote:
>> In our experience, despite what has been said and what we have read,
>> if we lose or take down a single OSS, our clients lose access (i/o
>> seems blocked) to the file system until that OSS is back up and has
>> completed recovery.    That's just or experience and it has been very
>> consistent.   We've never seen otherwise, though we would like  
>> to.  :)
>
> To be clear - a client process will wait indefinitely until an OST
> is back alive, unless either the process is killed (this should be
> possible after the Lustre recovery timeout is exceeded, 100s by
> default), or the OST is explicitly marked "inactive" on the clients:
>
> 	lctl --device {failed OSC device on client} deactivate
>
> After the OSC is marked inactive, then all IO to that OST should
> immediately return with -EIO, and not hang.

Thanks Andreas, I think that clears things up and will help us  
understand what to expect going forward.

> If you have experiences other than this it is a bug.  If this isn't
> explained in the documentation it is a documentation bug.

If that is spelled out clearly in the documentation, I missed it  
(certainly possible).   I hope I indicated that this business has  
never been a show-stopper for us.   Typically, if we lose an OSS or  
OST our top priority is getting it back in service.   As you indicate,  
most clients wait and resume when recovery is complete and this is  
usually fine with us.  In fact, its awesome and users understand it  
since it is akin to what they were used to w/ NFS - back in the day.

We love you man!   :)

Charlie Taylor
UF HPC Center