[Lustre-discuss] Lustre clients failing, and cant reconnect

Brian J. Murrell Brian.Murrell at Sun.COM
Thu Sep 4 20:19:04 PDT 2008


On Thu, 2008-09-04 at 22:58 -0400, Brock Palen wrote:
> I am having clients lose their connection to the MDS.  Messages on  
> the clients look like this:
> 
> Sep  4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000- 
> mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid  
> 10.164.3.246 at tcp was lost; in progress operations using this service  
> will wait for recovery to complete.
> Sep  4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error  
> occurred while communicating with 10.164.3.246 at tcp. The mds_connect  
> operation failed with -16
> 
> It will keep doing this trying to connect and spiting out mds_connect  
> failed -16.  The clients never recover.
> 
> On the mds  all I see is:
> 
> Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- 
> MDT0000: refuse reconnection from 618cf36e-a7a6- 
> a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still  
> busy with 3 active RPCs
> 
> This is common between many hosts that I get this RPC message.
> 
> Clients and servers are all using TCP.
> 
> Is this enough information?

Probably.  If you are running 1.6.5, try disabling statahead on all of
your clients...

# echo 0 > /proc/fs/lustre/.../statahead_max

Of course, this setting goes back to it's default of 32 on a reboot.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080904/24184e5b/attachment.pgp>


More information about the lustre-discuss mailing list