[Lustre-discuss] 1.8 client loses contact to 1.6 router
Michael Kluge
Michael.Kluge at tu-dresden.de
Fri Feb 3 05:34:18 PST 2012
Hi list,
we have a 1.6.7 fs running which still works nicely. One node exports this FS
(via 10GE) to another cluster that has some 1.8.5 patchless clients. These
clients at some point (randomly, I think) mark the router as down (lctl
show_route). It is always a different client and usually a few clients each
week that do this. Despite that we configured the clients to ping the router
again from time to time, the route never comes back. On these clients I can
still "ping" the IP of the router but "lctl ping" gives me an Input/Output
error. If I do somthing like:
lctl --net o2ib set_route 172.30.128.241 at tcp1 down
sleep 45
lctl --net o2ib del_route 172.30.128.241 at tcp1
sleep 45
lctl --net o2ib add_route 172.30.128.241 at tcp1
sleep 45
lctl --net o2ib set_route 172.30.128.241 at tcp1 up
the route comes back, sometimes the client works again but sometimes the
clients issue an "unexpected aliveness of peer .." and need a reboot.
I looked around and could not find a note whether 1.8. clients and 1.6 routers
will work together as expexted. Has anyone experience with this kind of setup
or an idea for further debugging?
Regards, Michael
modprobe.d/luste.conf on the 1.8.5 clients
-----------------------------------------8<------------------------------
options lnet networks=tcp1(eth0)
options lnet routes="o2ib 172.30.128.241 at tcp1;"
options lnet dead_router_check_interval=60 router_ping_timeout=30
-----------------------------------------8<------------------------------
--
Dr.-Ing. Michael Kluge
Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany
Contact:
Willersbau, Room A 208
Phone: (+49) 351 463-34217
Fax: (+49) 351 463-37773
e-mail: michael.kluge at tu-dresden.de
WWW: http://www.tu-dresden.de/zih
More information about the lustre-discuss
mailing list