[Lustre-discuss] Lustre clients failing, and cant reconnect
Brian J. Murrell
Brian.Murrell at Sun.COM
Thu Sep 4 20:19:04 PDT 2008
On Thu, 2008-09-04 at 22:58 -0400, Brock Palen wrote:
> I am having clients lose their connection to the MDS. Messages on
> the clients look like this:
>
> Sep 4 19:51:30 nyx-login2 kernel: Lustre: nobackup-MDT0000-
> mdc-00000101fc44e800: Connection to service nobackup-MDT0000 via nid
> 10.164.3.246 at tcp was lost; in progress operations using this service
> will wait for recovery to complete.
> Sep 4 19:51:30 nyx-login2 kernel: LustreError: 11-0: an error
> occurred while communicating with 10.164.3.246 at tcp. The mds_connect
> operation failed with -16
>
> It will keep doing this trying to connect and spiting out mds_connect
> failed -16. The clients never recover.
>
> On the mds all I see is:
>
> Lustre: 7653:0:(ldlm_lib.c:760:target_handle_connect()) nobackup-
> MDT0000: refuse reconnection from 618cf36e-a7a6-
> a7d9-077c-7cbaee1e80b3 at 141.212.31.43@tcp to 0x000001037c109000; still
> busy with 3 active RPCs
>
> This is common between many hosts that I get this RPC message.
>
> Clients and servers are all using TCP.
>
> Is this enough information?
Probably. If you are running 1.6.5, try disabling statahead on all of
your clients...
# echo 0 > /proc/fs/lustre/.../statahead_max
Of course, this setting goes back to it's default of 32 on a reboot.
b.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080904/24184e5b/attachment.pgp>
More information about the lustre-discuss
mailing list