[Lustre-discuss] o2ib possible network problems -- solved

Mon Sep 22 13:17:27 PDT 2008

Hello All,

I honestly do not know how it happened, but the value in
/proc/sys/lustre/timeout on the OSS box was set to 100.   All other
systems were set to 1000.
I changed the value on the OSS to 1000 and every error message on all
of the related systems stopped.   I got the idea to re-check from an
e-mail message sent by Brian Murrell archived on os-dir referring to
bug 16237.  Brian listed the above as another thing to check.

Interestingly enough, the readahead (blockdev --report /dev/sdX) on
the same OSS was set to 672.   I have no idea where that came from
either.  All of the other systems have a reported readahead value of
256.   I had changed the readahead value on OSS box first (blockdev
--setra 256 /dev/sdX).   The error messages did not stop until I fixed
the value in /proc/sys/lustre/timeout.

How could my /proc have such odd values in it?

I will see if the change holds for now.   I may have to do something
to make it persistent for future reboots.

Cheers!
megan