[Lustre-discuss] no failover with failover MDS

Thomas Roth t.roth at gsi.de
Sat Sep 18 11:51:30 PDT 2010

Hi all,

we have two servers A, B as a failover MGS/MDT pair, with IPs 
A=  and B= over tcp.
When server B crashes, MGS and MDT are mounted on A. Recovery times out 
with only one out of 445 clients recovered.
Afterwards, the MDT lists all its OSTs as UP and in the logs of the OSTs 
I see:

Lustre: MGC10.12.112.28 at tcp: Connection restored to service MGS using 
nid at tcp.
Lustre: lustre-OST008d: received MDS connection from at tcp

So far so good.

However, no client will reconnect, nor will a client connect to server A 
when freshly mounted!

I do "mount -t lustre /mp"

and get:

Lustre:     Lustre Version: 1.8.4
Lustre:     Build Version: 1.8.4-19700101010000-PRISTINE-2.6.26-2-amd64
Lustre: Added LNI at tcp [8/256/0/180]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
Lustre: MGC10.12.112.28 at tcp: Reactivating import
Lustre: 14530:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1347247522447397 sent from gsilust-MDT0000-mdc-ffff81033d489400 to NID at tcp 5s ago has timed out (5s prior to deadline).
    req at ffff8103312da400 x1347247522447397/t0 
o38->gsilust-MDT0000_UUID at lens 368/584 e 0 to 1 
dl 1284835365 ref 1 fl Rpc:N/0/0 rc 0/0

Obviously the clients stubbornly try to connect to the failed server,

I'm sure the failover has worked before, since server A had its problems 
last January, when the MDT was moved to B which has served the fs ever 
No apparent changes were introduced in the mean time, so now I am at a loss.


