[Lustre-discuss] Clients Unmounting Lustre

Tue Sep 1 16:46:35 PDT 2009

On Sep 01, 2009  15:13 -0400, Brian J. Murrell wrote:
> On Tue, 2009-09-01 at 11:34 -0700, Don Thorp wrote:
> > New hardware that will support the workload is on the way, but are  
> > there some changes I can make now to 1.6.6 that would increase  
> > reliability, even at the expense of performance?
> 
> With what you have given us to work with, my first suggestion would be
> to increase your obd_timeout.  You should not need to go higher than
> about 300 seconds, but should try to choose a value only high enough to
> stop the callback timeouts.  Higher obd_timeout values mean longer
> recoveries.
> 
> Additionally, you might look into tuning the number of OST threads on
> your OSSes if you are driving your disks too hard.  OST thread count,
> like obd_timeout should be just high enough, but not more, to reach
> maximum throughput.  If you have not baselined your hardware with the
> iokit, you can simply start dropping the OST thread counts until you
> find that you are impacting throughput.  It's a bit more trial and error
> than using the iokit, but if you are in production already, it's
> probably the best you can do.

Note that in 1.6 changing the oss thread count is not dynamic, it
needs a server restart.  In 1.8.1 (IIRC) it is possible to increase
the thread count at runtime, though it can't yet be reduced.

As a completely rough estimate, if you have 4 OSS threads per spindle,
that wouldn't be a terrible first approximation.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.