[Lustre-discuss] Recovery without end

Wed Feb 25 07:09:54 PST 2009

Ok. at an ETA of 8100 sec we lost patience and did
> lctl --device MDS-Name abort_recovery

This obviously did the trick,
>>  recovery period over; 1 clients never reconnected after 14483s (414
clients did)

Access to the system seems to work as expected.

Still we are not satisfied at all. One thing we would like to know,
urgently, is how to find out which client caused that delay.
As indicated before, we have no problem nuking a silly client, tearing
it apart, ripping out its memory banks or whatever violent action might
be needed.
Most probably, though, the fault lies within our configuration, not this
single client ( perhaps this is a machine that had a Lustre mount some
time ago and is now switched off - batch nodes tend to die every now and
then).

Our /proc/sys/lustre/timeout is 1000 - there has been some debate on
this large value here, but most other installation will not run in a
network environment with a setup as crazy as ours. Putting the timeout
to 100 immediately results in "Transport endpoint" errors, impossible to
run Lustre like this.

Since this is a 1.6.5.1 system, I activated the adaptive timeouts  - and
put them to equally large values,
/sys/module/ptlrpc/parameters/at_max = 6000
/sys/module/ptlrpc/parameters/at_history = 6000
/sys/module/ptlrpc/parameters/at_early_margin = 50
/sys/module/ptlrpc/parameters/at_extra = 30

Reading the manual, I understood that at_max is a maximum value. I
learned from an earlier question I posted on this list that with the
static timeout from /proc/sys/lustre/timeout, recovery will be 2.5 times
this value. Assuming the worst, 2.5 times at_max, I still don't arrive
at  21000 sec !

So I'm quite clueless as to what mistakes I have made here.

Btw, when trying to find out about connected/disconnected clients, I ran
"lctl conn_list", which gave me a very long listing (how do you do "
|less" in this lctl - shell?), with all entries marked as "nonagle" -
what does that mean?

Oh, last remark for the records: to do this "lctl abort_recovery"
command, you have to find out the right device number or name. "lctl dl"
gives me five entries on my MGS/MDT server, "mgs", "mgc" "mdt" "lov"
"mds". The correct device name for the lctl command is the one after "mds".

Regards,
Thomas

Thomas Roth wrote:
> Hi all,
> 
> we have a problem with our production system (v. 1.6.5.1). It is in
> recovery, but recovery never finishes.
> The background are some unknown problems with the MDT, attempts to
> restart the MDS etc. The MDT would start recovery, at some point during
> recovery lose connection to its OSTs, restart recovery and so on.
> 
> I then moved the service to a partner machine, where recovery started with
>>> 11:37:07: ... in recovery for at least 5:00, or until 415 clients
> reconnect.
> 
> (I always understood these numbers as minutes, the
> /proc/.../recovery_status usually starts at 3000 sec, though 5 min would
> be a little less...)
> 
> The countdown went on until
>>> 12:03:32:  ...227 clients in recovery for 1457s
> 
> Four minutes later, there were
>>> 12:07:21: ...133 recoverable clients remain
> 
> Then something bad must have happened, because
>>> 12:07:42:  ...121 clients in recovery for 20721s
> 
> Most of these clients seemed to be no problem, because only 4 minutes later
>>> 12:11:52:  ...1 clients in recovery for 20471s
> 
> So far, the countdown continues, but of course these are extremely long
> recovery times.
> 
> My questions:
> Where might I have misconfigured the system to wait that much for a client?
> Is there a command to abort the recovery?
> 
> All the OSTs seem to be connected and happy. I therefore guess that the
> remaining client is just one client in the ususal sense - a batch node
> or similar machine that still has the system mounted. Of course I would
> not hesitate to kick out that client - or many of these if necessary -
> but I don't know which it is.  So another question: How to find out
> about the identities of clients, recoverable/in recovery/without
> problems/gone for good ?
> 
> 
> Many thanks,
> Thomas
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

GSI Helmholtzzentrum fu"r Schwerionenforschung GmbH
Planckstra?e 1
D-64291 Darmstadt
www.gsi.de

Gesellschaft mit beschra"nkter Haftung
Sitz der Gesellschaft: Darmstadt
Handelsregister: Amtsgericht Darmstadt, HRB 1528

Gescha"ftsfu"hrer: Professor Dr. Horst Sto"cker

Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph,
Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt