[Lustre-discuss] Recovery fails if clients not connected

Tue Jan 20 17:13:14 PST 2009

Hi Roger,

I believe you can connect the OSSs once the MDS has booted, and in fact, I¹m
pretty sure that the five in the Œconnected_clients: 0/5¹ are in fact your
OSS nodes. Each OST maintains a connection to the MDS while the file system
is mounted, so they will be included in the connection count on the MDS.

However, regardless of the state ‹ if your MDS is online and the MDT is
mounted, you can start up the OSS nodes and corresponding OSTs at any time;
clients attempting to make transactions will have their I/O operations block
(or fail, depending on the MDS config) until the missing nodes come back
online.

hth,
Klaus

On 1/20/09 3:05 PM, "Roger Spellman" <roger at terascala.com> etched on stone
tablets:

> I have 2 MDS, configured as an active/standby pair.  I have 5 OSTs that are
> NOT active/standby.  I
> have 5 clients.
>  
> I am using Lustre 1.6.5, due to bug 18232
> <https://bugzilla.lustre.org/show_bug.cgi?id=18232>  which only affects 1.6.6.
> Using Lustre 1.6.5, when I
> reset my active node, the standby takes over.  This is quite reliable.
>  
> Today, I did the following in this order:
>   Unmounted all the clients
>   Rebooted all the clients
>   Stopped Linux HA from running
>   Unmounted the OSTs
>   Unmounted the MDS
>   Rebooted the OSTs
>   Rebooted both MDSes
>  
> When the MDSes started up, Linux HA chose one to be active.  That system
> mounted the MDT.
>  
> I looked at the file  /proc/fs/lustre/mds/tacc-MDT0000/recovery_status, and it
> showed:
>  
> [root at ts-tacc-01 ~]# cat /proc/fs/lustre/mds/tacc-MDT0000/recovery_status
> status: RECOVERING
> recovery_start: 0
> time_remaining: 0
> connected_clients: 0/5
> completed_clients: 0/5
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 17768
>  
>  
> ***** Note that recovery_start and time_remaining are both zero. *****
>  
> I waited a several minutes, and this file was the same.
>  
> I was waiting for recovery to complete before trying to mount the OSTs.
> However, it appears that
> this would never occur!
>  
> Does this look like a bug?
>  
> ---------------------------
>  
> I format my MDT using the following command.  The command is run from
> 10.2.43.1, and the failnode
> is 10.2.43.2:
>  
> mkfs.lustre --reformat --fsname tacc --mdt --mgs --device-size=10000000
> --mkfsoptions=' -m 0 -O
> mmp' --failnode=10.2.43.2 at o2ib0 /dev/sdb
>  
> I format the OSTs using the following command:
>  
> /usr/bin/time -p mkfs.lustre --reformat --ost --mkfsoptions='-J
> device=/dev/sdc1 -m 0' --fsname
> tacc --device-size=400000000 --mgsnode=10.2.43.1 at o2ib0
> --mgsnode=10.2.43.2 at o2ib0 /dev/sdb
>  
> I mount the clients using:
>  
> mount -t lustre 10.2.43.1 at o2ib:10.2.43.2 at o2ib:/tacc /mnt/lustre
>  
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090120/a925d7f8/attachment.htm>