[Lustre-discuss] 1.8 client loses contact to 1.6 router

Michael Kluge Michael.Kluge at tu-dresden.de
Fri Feb 3 05:34:18 PST 2012


Hi list,

we have a 1.6.7 fs running which still works nicely. One node exports this FS 
(via 10GE)  to another cluster that has some 1.8.5 patchless clients. These 
clients at some point (randomly, I think) mark the router as down (lctl 
show_route). It is always a different client and usually a few clients each 
week that do this. Despite that we configured the clients to ping the router 
again from time to time, the route never comes back. On these clients I can 
still "ping" the IP of the router but "lctl ping" gives me an Input/Output 
error. If I do somthing like:

lctl --net o2ib set_route 172.30.128.241 at tcp1 down
sleep 45
lctl --net o2ib del_route 172.30.128.241 at tcp1
sleep 45
lctl --net o2ib add_route 172.30.128.241 at tcp1
sleep 45
lctl --net o2ib set_route 172.30.128.241 at tcp1 up

the route comes back, sometimes the client works again but sometimes the 
clients issue an "unexpected aliveness of peer .." and need a reboot.

I looked around and could not find a note whether 1.8. clients and 1.6 routers 
will work together as expexted. Has anyone experience with this kind of setup 
or an idea for further debugging?


Regards, Michael

modprobe.d/luste.conf on the 1.8.5 clients
-----------------------------------------8<------------------------------
options lnet networks=tcp1(eth0)
options lnet routes="o2ib 172.30.128.241 at tcp1;"
options lnet dead_router_check_interval=60 router_ping_timeout=30
-----------------------------------------8<------------------------------



-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:    (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW:    http://www.tu-dresden.de/zih



More information about the lustre-discuss mailing list