[Lustre-discuss] Failover & recovery issues / questions

Mon Mar 30 16:38:17 PDT 2009

Hi-

I'm new to Lustre and am running into some issues with fail over and 
recovery that I can't seem to find answers to in the Lustre manual 
(v1.14).  If anyone can fill me in as to what is going on (or not going 
on), or point me toward some documentation that goes into more detail it 
would be greatly appreciated. 

It's a simple cluster at the moment:

MDT/MGS data is collocated on node LUS-MDT

LUS-OSS0 and LUS-OSS1 are set up in an active/active failover setup,. 
LUS-OSS0 is primary for /dev/drbd1 and backup for /dev/drbd2, LUS-OSS1 
is primary for /dev/drbd2 and backup for /dev/drbd1.  I have heartbeat 
configured to monitor and handle fail over, however, I run into the same 
problems when manually testing fail over.

When heartbeat is killed on either OSS and resources failed over to the 
backup, or when the filesystem is manually unmounted and remounted on 
the backup node, the migrated OST either 1, goes into a state of endless 
recovery or 2, doesn't seem to go into recovery at all.  It becomes 
inactive on the cluster entirely.  If I bring the OST's primary back up 
and fail back the resources, the OST goes into recovery, completes and 
comes back up online as it should.

For example, if I take down OSS0, the OST fails over to it's back up, 
however, it never makes it past this and never recovers:

[root at lus-oss0 ~]# cat 
/proc/fs/lustre/obdfilter/lustre-OST0000/recovery_status
status: RECOVERING
recovery_start: 0
time_remaining: 0
connected_clients: 0/4
completed_clients: 0/4
replayed_requests: 0/??
queued_requests: 0
next_transno: 2002

In some instances, /proc/fs/lustre/obdfilter/lustre-OST0000/ is empty.  
Like I said, when the primary node comes back online and resources are 
migrated back, the OST goes into recovery fine, completes and comes back 
up online.

Here are log output on the secondary node after fail over.

Lustre: 13290:0:(filter.c:867:filter_init_server_data()) RECOVERY: 
service lustre-OST0000, 4 recoverable clients, last_rcvd 2001
Lustre: lustre-OST0000: underlying device drbd2 should be tuned for 
larger I/O requests: max_sectors = 64 could be up to max_hw_sectors=255
Lustre: OST lustre-OST0000 now serving dev 
(lustre-OST0000/1ff44d23-d13a-b0c6-48e1-36c104ea6752), but will be in 
recovery for at least 5:00, or until 4 clients reconnect. During this 
time new clients will not be allowed to connect. Recovery progress can 
be monitored by watching 
/proc/fs/lustre/obdfilter/lustre-OST0000/recovery_status.
Lustre: Server lustre-OST0000 on device /dev/drbd2 has started
Lustre: Request x8184 sent from lustre-OST0000-osc-c6cedc00 to NID 
192.168.10.23 at tcp 100s ago has timed out (limit 100s).
Lustre: lustre-OST0000-osc-c6cedc00: Connection to service 
lustre-OST0000 via nid 192.168.10.23 at tcp was lost; in progress 
operations using this service will wait for recovery to complete.
Lustre: 3983:0:(import.c:410:import_select_connection()) 
lustre-OST0000-osc-c6cedc00: tried all connections, increasing latency to 6s