[Lustre-discuss] o2ib possible network problems -- solved

Brian J. Murrell Brian.Murrell at Sun.COM
Mon Sep 22 13:27:52 PDT 2008


On Mon, 2008-09-22 at 16:17 -0400, Ms. Megan Larko wrote:
> Hello All,
> 
> I honestly do not know how it happened, but the value in
> /proc/sys/lustre/timeout on the OSS box was set to 100.   All other
> systems were set to 1000.

FWIW, 1000 is waaaaay high.  Our biggest production systems (thousands
if not 10s of thousands) nodes don't use values higher than 300 seconds.
You might want to try lowering that value to 300 seconds (on all nodes
of course!) and see if you experience stability.  You might want to
experiment with even lower values (100s is default) and see where you
can maintain stability.  The downside of high obd_timeouts is long
recovery times.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080922/f71d83d3/attachment.pgp>


More information about the lustre-discuss mailing list