[Lustre-discuss] Recovery without end

Wed Feb 25 04:25:57 PST 2009

Hi all,

we have a problem with our production system (v. 1.6.5.1). It is in
recovery, but recovery never finishes.
The background are some unknown problems with the MDT, attempts to
restart the MDS etc. The MDT would start recovery, at some point during
recovery lose connection to its OSTs, restart recovery and so on.

I then moved the service to a partner machine, where recovery started with
>>11:37:07: ... in recovery for at least 5:00, or until 415 clients
reconnect.

(I always understood these numbers as minutes, the
/proc/.../recovery_status usually starts at 3000 sec, though 5 min would
be a little less...)

The countdown went on until
>> 12:03:32:  ...227 clients in recovery for 1457s

Four minutes later, there were
>> 12:07:21: ...133 recoverable clients remain

Then something bad must have happened, because
>> 12:07:42:  ...121 clients in recovery for 20721s

Most of these clients seemed to be no problem, because only 4 minutes later
>> 12:11:52:  ...1 clients in recovery for 20471s

So far, the countdown continues, but of course these are extremely long
recovery times.

My questions:
Where might I have misconfigured the system to wait that much for a client?
Is there a command to abort the recovery?

All the OSTs seem to be connected and happy. I therefore guess that the
remaining client is just one client in the ususal sense - a batch node
or similar machine that still has the system mounted. Of course I would
not hesitate to kick out that client - or many of these if necessary -
but I don't know which it is.  So another question: How to find out
about the identities of clients, recoverable/in recovery/without
problems/gone for good ?

Many thanks,
Thomas