[Lustre-discuss] Nodes claim error with files, then say everything is fine.
Chris Worley
worleys at gmail.com
Wed Aug 6 10:08:01 PDT 2008
On Wed, Aug 6, 2008 at 10:45 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
> On Wed, 2008-08-06 at 10:41 -0600, Chris Worley wrote:
>>
>> Is there anything in /proc or /sys I can look at to see whatever
>> "keepalive" parameters are setup?
>
> All timeouts are based on the obd_timeout in /proc/sys/lustre/timeout
> which MUST be the same on all nodes.
>
Would you suggest I increase or decrease this value?
Is there a way to inhibit the eviction, or is that necessary to keep
really dead clients from locking-out files.
>> The systems aren't dying.
>
> They are failing to communicate with the MDS for some reason. Network
> problems perhaps? You could try enabling +rpctrace debug and inspecting
> the debug file for RPCs to see if the client is indeed sending something
> (even if it's a ping) at regular intervals.
All the systems (RHEL4 and 5 clients, Lustre servers) are on the same
ethernet and IB switches. There were no issues before the 1.6.5.1
upgrade with the RHEL5 nodes.
Would a normal ping do it? I can jury-rig all the RHEL5 nodes to ping the MDS.
Chris
More information about the lustre-discuss
mailing list