<HTML>

<HEAD>

<TITLE>Re: [Lustre-discuss] Recovery fails if clients not connected</TITLE>

</HEAD>

<BODY>

<FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>

Hi Roger,<BR>

<BR>

I believe you can connect the OSSs once the MDS has booted, and in fact, I’m pretty sure that the five in the ‘connected_clients: 0/5’ are in fact your OSS nodes. Each OST maintains a connection to the MDS while the file system is mounted, so they will be included in the connection count on the MDS.<BR>

<BR>

However, regardless of the state — if your MDS is online and the MDT is mounted, you can start up the OSS nodes and corresponding OSTs at any time; clients attempting to make transactions will have their I/O operations block (or fail, depending on the MDS config) until the missing nodes come back online.<BR>

<BR>

hth,<BR>

Klaus<BR>

<BR>

<BR>

On 1/20/09 3:05 PM, "Roger Spellman" <roger@terascala.com> etched on stone tablets:<BR>

<BR>

</SPAN></FONT><BLOCKQUOTE><FONT SIZE="4"><FONT FACE="Courier New"><SPAN STYLE='font-size:13.0px'>I have 2 MDS, configured as an active/standby pair.  I have 5 OSTs that are NOT active/standby.  I<BR>

have 5 clients.<BR>

 <BR>

I am using Lustre 1.6.5, due to bug 18232 <https://bugzilla.lustre.org/show_bug.cgi?id=18232>  which only affects 1.6.6.  Using Lustre 1.6.5, when I<BR>

reset my active node, the standby takes over.  This is quite reliable.<BR>

 <BR>

Today, I did the following in this order:<BR>

  Unmounted all the clients<BR>

  Rebooted all the clients<BR>

  Stopped Linux HA from running<BR>

  Unmounted the OSTs<BR>

  Unmounted the MDS<BR>

  Rebooted the OSTs<BR>

  Rebooted both MDSes<BR>

 <BR>

When the MDSes started up, Linux HA chose one to be active.  That system mounted the MDT.<BR>

 <BR>

I looked at the file  /proc/fs/lustre/mds/tacc-MDT0000/recovery_status, and it showed:<BR>

 <BR>

[root@ts-tacc-01 ~]# cat /proc/fs/lustre/mds/tacc-MDT0000/recovery_status <BR>

status: RECOVERING<BR>

recovery_start: 0<BR>

time_remaining: 0<BR>

connected_clients: 0/5<BR>

completed_clients: 0/5<BR>

replayed_requests: 0/??<BR>

queued_requests: 0<BR>

next_transno: 17768<BR>

 <BR>

 <BR>

***** Note that recovery_start and time_remaining are both zero. *****<BR>

 <BR>

I waited a several minutes, and this file was the same.<BR>

 <BR>

I was waiting for recovery to complete before trying to mount the OSTs.  However, it appears that<BR>

this would never occur!<BR>

 <BR>

Does this look like a bug? <BR>

 <BR>

---------------------------<BR>

 <BR>

I format my MDT using the following command.  The command is run from 10.2.43.1, and the failnode<BR>

is 10.2.43.2:<BR>

 <BR>

mkfs.lustre --reformat --fsname tacc --mdt --mgs --device-size=10000000 --mkfsoptions=' -m 0 -O<BR>

mmp' --failnode=10.2.43.2@o2ib0 /dev/sdb<BR>

 <BR>

I format the OSTs using the following command:<BR>

 <BR>

/usr/bin/time -p mkfs.lustre --reformat --ost --mkfsoptions='-J device=/dev/sdc1 -m 0' --fsname<BR>

tacc --device-size=400000000 --mgsnode=10.2.43.1@o2ib0 --mgsnode=10.2.43.2@o2ib0 /dev/sdb<BR>

 <BR>

I mount the clients using:<BR>

 <BR>

mount -t lustre 10.2.43.1@o2ib:10.2.43.2@o2ib:/tacc /mnt/lustre<BR>

</SPAN></FONT><SPAN STYLE='font-size:13.0px'><FONT FACE="Arial"> <BR>

</FONT></SPAN></FONT><FONT FACE="Verdana, Helvetica, Arial"><SPAN STYLE='font-size:12.0px'><BR>

<HR ALIGN=CENTER SIZE="3" WIDTH="95%"></SPAN></FONT><FONT SIZE="2"><FONT FACE="Monaco, Courier New"><SPAN STYLE='font-size:10.0px'>_______________________________________________<BR>

Lustre-discuss mailing list<BR>

Lustre-discuss@lists.lustre.org<BR>

<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><BR>

</SPAN></FONT></FONT></BLOCKQUOTE><FONT SIZE="2"><FONT FACE="Monaco, Courier New"><SPAN STYLE='font-size:10.0px'><BR>

</SPAN></FONT></FONT>

</BODY>

</HTML>