[Lustre-discuss] Recovery without end

Wed Feb 25 08:37:04 PST 2009

On Wed, 2009-02-25 at 11:22 -0500, Charles Taylor wrote:
> I know you will be tempted to tell us that our network must be flakey  
> but it simply is not.   We'd love to understand why we need such a  
> large timeout value and why, if we don't use a large value, we see  
> these transport end-point failures.    However, after spending several  
> days trying to understand and resolve the issue, we finally just  
> accepted the long timeout as a suitable workaround.

I'd encourage you to upgrade to the latest version of Lustre (just so we
are not chasing possibly old and fixed bugs) and re-evaluate your
timeout and report how it works out for you.  If you still see
unreliability, then file a bug.

I'd also suggest (if you have not already done it) that you use the
iokit to be sure your OSSes are properly tuned for the storage bandwidth
they have available to them and not tying up OST processes for overly
long periods of time waiting for storage access.

> I wonder if there are others who have silently done the same.   We'll  
> be upgrading to 1.6.6 or 1.6.7 in the not-too-distant future.    Maybe  
> then we'll be able to do away with the long timeout value but until  
> then, we need it.  :(

Sounds like a good idea.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090225/51dacc11/attachment.pgp>