[Lustre-discuss] Another server question.

Wed Feb 4 01:33:18 PST 2009

On Feb 03, 2009  12:21 -0500, Charles Taylor wrote:
> In our experience, despite what has been said and what we have read,  
> if we lose or take down a single OSS, our clients lose access (i/o  
> seems blocked) to the file system until that OSS is back up and has  
> completed recovery.    That's just or experience and it has been very  
> consistent.   We've never seen otherwise, though we would like to.  :)

To be clear - a client process will wait indefinitely until an OST
is back alive, unless either the process is killed (this should be
possible after the Lustre recovery timeout is exceeded, 100s by
default), or the OST is explicitly marked "inactive" on the clients:

	lctl --device {failed OSC device on client} deactivate

After the OSC is marked inactive, then all IO to that OST should
immediately return with -EIO, and not hang.

If you have experiences other than this it is a bug.  If this isn't
explained in the documentation it is a documentation bug.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.