[Lustre-discuss] no failover with failover MDS

Thomas Roth t.roth at gsi.de
Sat Sep 18 11:51:30 PDT 2010


Hi all,

we have two servers A, B as a failover MGS/MDT pair, with IPs 
A=10.12.112.28  and B=10.12.115.120 over tcp.
When server B crashes, MGS and MDT are mounted on A. Recovery times out 
with only one out of 445 clients recovered.
Afterwards, the MDT lists all its OSTs as UP and in the logs of the OSTs 
I see:

Lustre: MGC10.12.112.28 at tcp: Connection restored to service MGS using 
nid 10.12.112.28 at tcp.
Lustre: lustre-OST008d: received MDS connection from 10.12.112.28 at tcp

So far so good.

However, no client will reconnect, nor will a client connect to server A 
when freshly mounted!

I do "mount -t lustre 10.12.112.28:10.12.115.120:/lustre /mp"

and get:

Lustre:     Lustre Version: 1.8.4
Lustre:     Build Version: 1.8.4-19700101010000-PRISTINE-2.6.26-2-amd64
Lustre: Added LNI 10.12.68.195 at tcp [8/256/0/180]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
Lustre: MGC10.12.112.28 at tcp: Reactivating import
Lustre: 14530:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request 
x1347247522447397 sent from gsilust-MDT0000-mdc-ffff81033d489400 to NID 
10.12.115.120 at tcp 5s ago has timed out (5s prior to deadline).
    req at ffff8103312da400 x1347247522447397/t0 
o38->gsilust-MDT0000_UUID at 10.12.115.120@tcp:12/10 lens 368/584 e 0 to 1 
dl 1284835365 ref 1 fl Rpc:N/0/0 rc 0/0

Obviously the clients stubbornly try to connect to the failed server, 
10.12.115.120.

I'm sure the failover has worked before, since server A had its problems 
last January, when the MDT was moved to B which has served the fs ever 
since.
No apparent changes were introduced in the mean time, so now I am at a loss.

Yours,
Thomas




More information about the lustre-discuss mailing list