[Lustre-discuss] o2ib possible network problems -- solved

Brian Behlendorf behlendorf1 at llnl.gov
Mon Sep 22 16:34:45 PDT 2008


> FWIW, 1000 is waaaaay high.  Our biggest production systems (thousands
> if not 10s of thousands) nodes don't use values higher than 300 seconds.

Since I'm here at LLNL and we happen to have a few of the large systems maybe 
I should chime in.  While it is true our large systems (many thousands of 
nodes) use a timeout value of 300s, it is not true that they prevent all of 
our timeouts.  The 300s value has just shown itself through actual usage to 
prevent 99% of our timeouts and still allow reasonable length recovery times.  
It certainly does not prevent all of our timeouts.  To get to that point I 
feel the only viable solution is to validate the new adaptive timeout feature 
for our production use.

-- 
Thanks,
Brian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080922/b137a69a/attachment.pgp>


More information about the lustre-discuss mailing list