[Lustre-discuss] Dual NICs issue -- How to enforce Lustre to use the second NIC

Daneil Goodman daneil.goodman at gmail.com
Wed Nov 11 14:07:39 PST 2009


Hello list,

By searching the archive, I found a similar message dated back in January
2008 -- How do you make an MGS/OSS listen on 2 NICs? Looks like there is no
final solution and I am facing the similar situation and need your help.

I am running centos 5 on both server (MGS, MDS and OSS are on same node) and
clients: 2.6.18-128.1.6.el5_lustre.1.8.0.1smp. To simplify the issue,
suppose the network is consist of one lustre server node and two lustre
client nodes. The server node has two NICs: eth0(100Mb) and eth1(1Gb), each
client node only has one NIC:eth0. The network layout is as below.

Server node eth0: 72.203.10.1 (Public network)    <==> Switch1 <==> Public
node eth0:  72.203.10.2 (Public network)
Server node eth1: 192.168.10.1 (Internal network) <==> Switch2 <==> Private
node eth0: 192.168.10.2 (Internal network)

Both SELinux and Fireware are turned off. Public node does not know Private
node, but Private node do knows Public node.

The modprobe.conf likes the following:

On server: options lnet networks="tcp0(eth0),tcp1(eth1)"
On clients: options lnet networks=tcp  <--- since there is only one NIC, I
did not specify it as tcp(eth0).

The procedure to build MGS, MDT and OST on server node as bellow:

[root at server ~]# mkfs.lustre --mgs /dev/sdb1
[root at server ~]# mkdir /mgs
[root at server ~]# mount -t lustre /dev/sdb1 /mgs
[root at server ~]# mkfs.lustre --mdt --fsname=data --mgsnode="72.203.10.1 at tcp0
,192.168.10.1 at tcp1" /dev/sdb2
[root at server ~]# mkdir /data
[root at server ~]# mount -t lustre /dev/sdb2 /data
[root at server ~]# mkfs.lustre --ost --fsname=data --mgsnode="72.203.10.1 at tcp0
,192.168.10.1 at tcp1" /dev/sdc1
[root at server ~]# mkdir /mnt/data
[root at server ~]# mount -t lustre /dev/sdc1 /mnt/data

On server node:
[root at server ~]# lctl list_nids
72.203.10.1 at tcp
192.168.10.1 at tcp1

[root at server ~]# lctl ping 72.203.10.1 at tcp
12345-0 at lo
12345-72.203.10.1 at tcp
12345-192.168.10.1 at tcp1

[root at server ~]# lctl ping 192.168.10.1 at tcp1
12345-0 at lo
12345-72.203.10.1 at tcp
12345-192.168.10.1 at tcp1

Therefore, every thing on server node looks good, and I can mount /data on
public node:
[root at public ~]# mount -t lustre 72.203.10.1 at tcp:/data /data

Also I can mount /data on private node using server's eth0 (public NIC)
[root at private ~]# mount -t lustre 72.203.10.1 at tcp:/data /data

But I can not mount /data on private node using server's eth1 (private NIC).

[root at private ~]# mount -t lustre 192.168.10.1 at tcp1:/data /data
mount.lustre: /data inaccessible: No such file or directory

On private node:
[root at private ~]# lctl which_nid 192.168.10.1 at tcp1
No reachable NID

[root at private ~]# lctl which_nid 192.168.10.1 at tcp
192.168.10.1 at tcp

[root at private ~]# lctl which_nid 72.203.10.1 at tcp
72.203.10.1 at tcp

[root at private ~]# lctl ping 72.203.10.1 at tcp
12345-0 at lo
12345-72.203.10.1 at tcp
12345-192.168.10.1 at tcp1

We see the NID of 192.168.10.1 at tcp from the private node, but can not ping
it:
[root at private ~]# lctl ping 192.168.10.1 at tcp1
failed to ping 192.168.10.1 at tcp1: Input/output error

[root at private ~]# lctl ping 192.168.10.1 at tcp
failed to ping 192.168.10.1 at tcp: Input/output error

The error message in /var/log/messages are
LustreError: 27435:0:(lib-move.c:1265:lnet_send()) No route to
12345-192.168.10.1 at tcp1
LustreError: 27435:0:(lib-move.c:2450:LNetGet()) error sending GET to
12345-192.168.10.1 at tcp1: -113

But I can ping 192.168.10.1 using regular ping.

According to the response of Marc:
"As far as I know, LNET will use the shortest path on the network, so if you
have two equivalent tcp networks, tcp0 and tcp1, LNET will just  use the
first one.  If it fails, it should use the second one. If both NICs are in
the same tcp network, LNET should use both.....".

Therefore, I can understand why I can mount /data on private node using
server's eth0. But looks like LNET is not smart enough to pick up the
"closer" NIC to mount the /data.

My questions are:
1. Is something wrong of the eth1configuration?
2. How to enforce LNET to use the second NIC(eth1) to mount /data on private
node? I prefer to use eth1 not eth0 because eth1 connects to switch2
directly and fast (eth1-1Gb) than eth0 (100Mb).

Thanks for reading the detailed procedure and appreciate for your help in
advance,

Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20091111/c1c68d79/attachment.htm>


More information about the lustre-discuss mailing list