[Lustre-discuss] Failover & recovery issues / questions

Mon Mar 30 17:16:30 PDT 2009

Hi, I am not familiar with using heartbeat with the OSS, I have only used it on the MDS for failover, since you can't have an active/active configuration on the MDS. However, you can have active/active on the OSS, I can't understand why would you want to use heartbeat to unmount the OSTs on one system if you can have them mounted on both?

Now when you say you "kill" heartbeat, what do you mean by that? You can't test heartbeat functionality by killing it, you have to use the provided tools for failing over to the other node. The tool usage and parameters depend on what version of heartbeat you are using.

Do you have a serial connection between these machines or a crossover cable for heartbeat or do you use the regular network?

jab  

> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org 
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of 
> Adam Gandelman
> Sent: Monday, March 30, 2009 4:38 PM
> To: lustre-discuss at lists.lustre.org
> Subject: [Lustre-discuss] Failover & recovery issues / questions
> 
> Hi-
> 
> I'm new to Lustre and am running into some issues with fail 
> over and recovery that I can't seem to find answers to in the 
> Lustre manual (v1.14).  If anyone can fill me in as to what 
> is going on (or not going on), or point me toward some 
> documentation that goes into more detail it would be greatly 
> appreciated. 
> 
> It's a simple cluster at the moment:
> 
> MDT/MGS data is collocated on node LUS-MDT
> 
> LUS-OSS0 and LUS-OSS1 are set up in an active/active failover setup,. 
> LUS-OSS0 is primary for /dev/drbd1 and backup for /dev/drbd2, 
> LUS-OSS1 is primary for /dev/drbd2 and backup for /dev/drbd1. 
>  I have heartbeat configured to monitor and handle fail over, 
> however, I run into the same problems when manually testing fail over.
> 
> When heartbeat is killed on either OSS and resources failed 
> over to the backup, or when the filesystem is manually 
> unmounted and remounted on the backup node, the migrated OST 
> either 1, goes into a state of endless recovery or 2, doesn't 
> seem to go into recovery at all.  It becomes inactive on the 
> cluster entirely.  If I bring the OST's primary back up and 
> fail back the resources, the OST goes into recovery, 
> completes and comes back up online as it should.
> 
> For example, if I take down OSS0, the OST fails over to it's 
> back up, however, it never makes it past this and never recovers:
> 
> [root at lus-oss0 ~]# cat
> /proc/fs/lustre/obdfilter/lustre-OST0000/recovery_status
> status: RECOVERING
> recovery_start: 0
> time_remaining: 0
> connected_clients: 0/4
> completed_clients: 0/4
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 2002
> 
> In some instances, /proc/fs/lustre/obdfilter/lustre-OST0000/ 
> is empty.  
> Like I said, when the primary node comes back online and 
> resources are migrated back, the OST goes into recovery fine, 
> completes and comes back up online.
> 
> Here are log output on the secondary node after fail over.
> 
> Lustre: 13290:0:(filter.c:867:filter_init_server_data()) RECOVERY: 
> service lustre-OST0000, 4 recoverable clients, last_rcvd 2001
> Lustre: lustre-OST0000: underlying device drbd2 should be 
> tuned for larger I/O requests: max_sectors = 64 could be up 
> to max_hw_sectors=255
> Lustre: OST lustre-OST0000 now serving dev 
> (lustre-OST0000/1ff44d23-d13a-b0c6-48e1-36c104ea6752), but 
> will be in recovery for at least 5:00, or until 4 clients 
> reconnect. During this time new clients will not be allowed 
> to connect. Recovery progress can be monitored by watching 
> /proc/fs/lustre/obdfilter/lustre-OST0000/recovery_status.
> Lustre: Server lustre-OST0000 on device /dev/drbd2 has started
> Lustre: Request x8184 sent from lustre-OST0000-osc-c6cedc00 
> to NID 192.168.10.23 at tcp 100s ago has timed out (limit 100s).
> Lustre: lustre-OST0000-osc-c6cedc00: Connection to service 
> lustre-OST0000 via nid 192.168.10.23 at tcp was lost; in 
> progress operations using this service will wait for recovery 
> to complete.
> Lustre: 3983:0:(import.c:410:import_select_connection())
> lustre-OST0000-osc-c6cedc00: tried all connections, 
> increasing latency to 6s
> 
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>