[Lustre-discuss] Mount takes long time after abnormal shutdown of MDS/OSS

Mon May 27 17:38:47 PDT 2013

On 2013-05-27, at 9:00, "Chan Ching Yu, Patrick" <cychan at clustertech.com<mailto:cychan at clustertech.com>> wrote:

In my testing environment, there are one MDS/OSS server and one Lustre client, running on CentOS 6.3. Lustre 2.1.5 is used.

I tried to power off the MDS/OSS server abnormally while Lustre filesystem is still mounted on Lustre client.
Then I power off Lustre client, start MDS/OSS and Lustre client. However, Lustre client takes long time to mount.

This is expected behavior if the clients are shut down after the servers. Since Lustre clients may have recovery state after a server crash, the servers wait after restart for the clients to reconnect and perform recovery.  This can happen relatively quickly if all of the clients are available.

If the clients have been rebooted, the servers will wait for the old clients to connect, but this never happens.  New clients are prevented from connecting during recovery so that they do not modify the filesystem in a way that was incompatible with what the old clients previously did.

If you know the old clients are not available, you can mount the servers with "-o abort_recovery" to skip this delay.

The following repeated messages are generated on console:
May 27 22:06:33 node1 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.8.1 at tcp. The mds_connect operation failed with –16

-16 is -EBUSY, which means the servers are busy during recovery, and are blocking new clients from reconnecting until recovery is finished.

Thanks.
CY

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org<mailto:Lustre-discuss at lists.lustre.org>
http://lists.lustre.org/mailman/listinfo/lustre-discuss