[Lustre-discuss] MDS refuses connections (no visible reason)

Thu Mar 5 09:56:20 PST 2009

Hi all,

after running for days without any problems, our MDS is refusing
cooperation for two hours now.
The log files show nothing until
>Mar  5 16:46:24 mds1 kernel: Lustre:
17841:0:(ldlm_lib.c:525:target_handle_reconnect()) MDT0000: 481fa70b-590d
-31b6-f621-c6125a54bfff reconnecting
>Mar  5 16:46:24 mds1 kernel: Lustre:
17841:0:(ldlm_lib.c:760:target_handle_connect()) MDT0000: refuse reconnec
tion from 481fa70b-590d-31b6-f621-c6125a54bfff at 1.2.3.4@tcp to
0xffff8107ef44a000; still busy with 2 active RPCs

I thought that such a thing would be between the MDT and this particular
client. However, the log goes on like that with many other clients.

Now the MDS is refusing any connection, bringing the system to a stand
still.

The situation also triggered the dumping of ca. 130 log dumps to /tmp.
Most of these are small and contain just
>Watchdog triggered for pid 17866: it was inactive for 12000s
>nable to dump stack because of missing export

A few are larger and contain more complaints about lengthy requests and
possible timeouts:
>ptlrpc_server_handle_request   Request x75091039 took longer than
estimated (42+4208s); client may timeout.
or
>ptlrpc_server_handle_request   Dropping timed-out request from
12345-140.181.114.222 at tcp: deadline 1000+923s ago

All of these do not seem critical?
Maybe all clients have timed out for some reason?
Even so, I'd assume the MDS to be still responsive, say to a mount
request from a fresh client, one that does not possibly have any
leftover transactions pending on it?

Right now the only thing I see to do is to reboot the server. Of course
not a nice procedure on a system we advertised as stable and reliable to
our users...

So any help will be much appreciated.
Regards,
Thomas