[Lustre-discuss] Nodes claim error with files, then say everything is fine.

Wed Aug 6 10:08:01 PDT 2008

On Wed, Aug 6, 2008 at 10:45 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
> On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:
>>
>> Is there anything in /proc or /sys I can look at to see whatever
>> "keepalive" parameters are setup?
>
> All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout
> which MUST be the same on all nodes.
>

Would you suggest I increase or decrease this value?

Is there a way to inhibit the eviction, or is that necessary to keep
really dead clients from locking-out files.

>> The systems aren't dying.
>
> They are failing to communicate with the MDS for some reason.  Network
> problems perhaps?  You could try enabling +rpctrace debug and inspecting
> the debug file for RPCs to see if the client is indeed sending something
> (even if it's a ping) at regular intervals.

All the systems (RHEL4 and 5 clients, Lustre servers) are on the same
ethernet and IB switches.  There were no issues before the 1.6.5.1
upgrade with the RHEL5 nodes.

Would a normal ping do it?  I can jury-rig all the RHEL5 nodes to ping the MDS.

Chris