[Lustre-discuss] login nodes still hang with 1.6.6

Sun Dec 7 22:29:06 PST 2008

On Dec 07, 2008  09:09 -0500, Brock Palen wrote:
> We upgraded out clients to 1.6.6 and the servers are still 1.6.5  we  
> are still seeing where the login nodes much more than the compute  
> nodes start being evicted but never are able to recover,
> 
> LustreError: 11-0: an error occurred while communicating with  
> 10.164.3.246 at tcp. The mds_connect operation failed with -16
> LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got  
> rc -11 from cancel RPC: canceling anyway
> LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
> progress operations using this service will fail.

Having error messages from the servers is critical to figure out what
is going on.

> Is this the same bug?  The compute nodes look mostly ok, but the  
> above still happens every few days.   I don't notice any mention of  
> statahead but should I go ahead and set it to 0 again?

At worst it would only slow down the "ls -l" performance, and would
tell us whether statahead is the culprit or not.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.