[Lustre-discuss] Recovery fails if clients not connected
klaus.steden at technicolor.com
Tue Jan 20 17:13:14 PST 2009
I believe you can connect the OSSs once the MDS has booted, and in fact, I¹m
pretty sure that the five in the connected_clients: 0/5¹ are in fact your
OSS nodes. Each OST maintains a connection to the MDS while the file system
is mounted, so they will be included in the connection count on the MDS.
However, regardless of the state if your MDS is online and the MDT is
mounted, you can start up the OSS nodes and corresponding OSTs at any time;
clients attempting to make transactions will have their I/O operations block
(or fail, depending on the MDS config) until the missing nodes come back
On 1/20/09 3:05 PM, "Roger Spellman" <roger at terascala.com> etched on stone
> I have 2 MDS, configured as an active/standby pair. I have 5 OSTs that are
> NOT active/standby. I
> have 5 clients.
> I am using Lustre 1.6.5, due to bug 18232
> <https://bugzilla.lustre.org/show_bug.cgi?id=18232> which only affects 1.6.6.
> Using Lustre 1.6.5, when I
> reset my active node, the standby takes over. This is quite reliable.
> Today, I did the following in this order:
> Unmounted all the clients
> Rebooted all the clients
> Stopped Linux HA from running
> Unmounted the OSTs
> Unmounted the MDS
> Rebooted the OSTs
> Rebooted both MDSes
> When the MDSes started up, Linux HA chose one to be active. That system
> mounted the MDT.
> I looked at the file /proc/fs/lustre/mds/tacc-MDT0000/recovery_status, and it
> [root at ts-tacc-01 ~]# cat /proc/fs/lustre/mds/tacc-MDT0000/recovery_status
> status: RECOVERING
> recovery_start: 0
> time_remaining: 0
> connected_clients: 0/5
> completed_clients: 0/5
> replayed_requests: 0/??
> queued_requests: 0
> next_transno: 17768
> ***** Note that recovery_start and time_remaining are both zero. *****
> I waited a several minutes, and this file was the same.
> I was waiting for recovery to complete before trying to mount the OSTs.
> However, it appears that
> this would never occur!
> Does this look like a bug?
> I format my MDT using the following command. The command is run from
> 10.2.43.1, and the failnode
> is 10.2.43.2:
> mkfs.lustre --reformat --fsname tacc --mdt --mgs --device-size=10000000
> --mkfsoptions=' -m 0 -O
> mmp' --failnode=10.2.43.2 at o2ib0 /dev/sdb
> I format the OSTs using the following command:
> /usr/bin/time -p mkfs.lustre --reformat --ost --mkfsoptions='-J
> device=/dev/sdc1 -m 0' --fsname
> tacc --device-size=400000000 --mgsnode=10.2.43.1 at o2ib0
> --mgsnode=10.2.43.2 at o2ib0 /dev/sdb
> I mount the clients using:
> mount -t lustre 10.2.43.1 at o2ib:10.2.43.2 at o2ib:/tacc /mnt/lustre
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss