[Lustre-discuss] lustre using wrong network

Michael Di Domenico mdidomenico4 at gmail.com
Thu Jun 18 18:11:50 PDT 2009


I cannot figure out what exactly has happened here and how to recover from it.

Jun 18 21:02:52 node0-eth1 kernel: LustreError:
2722:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:02:52 node0-eth1 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?

for some reason when i mount the OST on the above node it's trying to
connect to itself on eth0, even though i have networks=tcp0(eth1) in
my modprobe.conf and the NID is set to 192.168.1.248

Jun 18 21:02:52 node0-eth1 kernel: Lustre: Client data1-client has started
Jun 18 21:02:52 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI

I'm trying to mount a client now, and for some reason it's using eth0,
even though modprobe.conf says network=tcp0(eth1)

Jun 18 21:02:57 node0-eth1 kernel: Lustre: Request x1002438670 sent
from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 5s ago has
timed out (limit 5s).
Jun 18 21:03:05 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.254 for 192.168.0.248 at tcp: No matching NI
Jun 18 21:03:05 node1-eth0 kernel: Lustre:
2527:0:(import.c:508:import_select_connection()) data1-OST0005-osc:
tried all connections, increasing latency to 30s
Jun 18 21:03:05 node1-eth0 kernel: Lustre:
2527:0:(import.c:508:import_select_connection()) Skipped 2 previous
similar messages
Jun 18 21:03:05 node1-eth0 kernel: LustreError:
2520:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:03:05 node1-eth0 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?
Jun 18 21:03:10 node0-eth1 ntpd[2321]: synchronized to 204.9.136.253, stratum 2
Jun 18 21:03:17 node0-eth1 kernel: Lustre:
2727:0:(import.c:508:import_select_connection())
data1-OST0005-osc-f4070600: tried all connections, increasing latency
to 5s
Jun 18 21:03:17 node0-eth1 kernel: LustreError:
2719:0:(socklnd_cb.c:2156:ksocknal_recv_hello()) Error -104 reading
HELLO from 192.168.0.248
Jun 18 21:03:17 node0-eth1 kernel: LustreError: 11b-b: Connection to
192.168.0.248 at tcp at host 192.168.0.248 on port 988 was reset: is it
running a compatible version of Lustre and is 192.168.0.248 at tcp one of
its NIDs?
Jun 18 21:03:17 node7-eth0 kernel: LustreError: 120-3: Refusing
connection from 192.168.0.50 for 192.168.0.248 at tcp: No matching NI
Jun 18 21:03:27 node0-eth1 kernel: Lustre: Request x1002438682 sent
from data1-OST0005-osc-f4070600 to NID 192.168.0.248 at tcp 10s ago has
timed out (limit 10s).
Jun 18 21:03:33 node3-eth0 ntpd[2315]: synchronized to 192.168.0.50, stratum 3

Then i get in this visicious cycle where the filesystem will mount,
but i'm unable to df or ls it



More information about the lustre-discuss mailing list