[Lustre-discuss] login nodes still hang with 1.6.6
Andreas Dilger
adilger at sun.com
Sun Dec 7 22:29:06 PST 2008
On Dec 07, 2008 09:09 -0500, Brock Palen wrote:
> We upgraded out clients to 1.6.6 and the servers are still 1.6.5 we
> are still seeing where the login nodes much more than the compute
> nodes start being evicted but never are able to recover,
>
> LustreError: 11-0: an error occurred while communicating with
> 10.164.3.246 at tcp. The mds_connect operation failed with -16
> LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got
> rc -11 from cancel RPC: canceling anyway
> LustreError: 167-0: This client was evicted by nobackup-MDT0000; in
> progress operations using this service will fail.
Having error messages from the servers is critical to figure out what
is going on.
> Is this the same bug? The compute nodes look mostly ok, but the
> above still happens every few days. I don't notice any mention of
> statahead but should I go ahead and set it to 0 again?
At worst it would only slow down the "ls -l" performance, and would
tell us whether statahead is the culprit or not.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list