[Lustre-discuss] Network name o2ib0 collision in two discrete filesystems

James Robnett jrobnett at aoc.nrao.edu
Tue Sep 9 04:04:58 PDT 2014


I'm having difficulty figuring out a solution to an LNET issue I'm having.

We have two Lustre filesystems separated by about 60 miles, both of 
which have o2ib0(ib0) and tcp(eth0) networks defined.  Both have IB and 
TCP clients which work just fine.

I'll call them FS1 and FS2.

FS1-mds at ib0  192.168.1.11
FS1-mds at eth0 10.1.1.11

FS2-mds at ib0  192.168.2.11
FS2-mds at eth0 10.1.2.11

We have a need for a client physically at site-1 to mount the 
filesystems from both sites.  The intent is to mount the local FS1 via 
IB0 and the remote FS2 via TCP0 (accessible over gbit).

The mount commands for the client are:
mount −t lustre 192.168.1.11 at o2ib0:/lustre /lustre/FS1
mount −t lustre 10.1.2.11 at tcp0:/lustre /lustre/FS2

If I set this client's modprobe.conf line as

options network=o2ib0(ib0), tcp0(eth0)

then it mounts FS1 without issue but then fails on FS2 since it tries to 
communicate via o2ib0 despite the mount command specifying tcp0. 
Presumably since the client asserts it knows about both o2ib0 and tcp0 
without realizing o2ib0 at site1 is functionally different from o2ib0 at 
site2.

If I set the client's modprobe.conf line as
options network=tcp0(eth0), o2ib0(ib0)

then it mounts FS1 just fine but actually communicates via TCP0 (visible 
through /proc/sys/lnet/peers) since there's a network path that works 
and it's first in the list.  It also mounts FS2 just fine as expected.

So I can mount on or the other but not both or at least not both in the 
way that we need (i.e. IB for site1 and TCP for site2).

I'd begun looking into setting up an LNET router at site2 but I'm 
suspicious that won't actually help or it will help but only if I set it 
up in such a way that it disturbs existing IB0 and TCP0 clients there.

I tried briefly to set up an LNET router at site 1 that only knew about 
tcp0.  I put a routes line on the client pointing tcp0 at <lnetIP>@tcp0.
The LNET router can see and lctl ping the FS2 MDS but the client throws 
an error on startup and doesn't seem to believe there's a route.

I'm beginning to sense that the only real option is to get rid of the IB 
name collision and do a tunefs at site2 and change the servers and 
clients to use o2ib1 rather than o2ib0, or other permutations of 
renaming networks, but maybe (hopefully) I'm missing something with lnet 
routing.

On a side note it's mildly confusing that the ordering of lnet options 
networks= line takes precedence over the mount command.  If that weren't 
the case then either modprobe.conf line ordering above would work rather 
than neither but maybe there's a case I'm missing that requires that 
lnet option ordering takes precedence over the mount syntax.

Of course there's the very real possibility I'm missing an obvious 
simple solution.

James Robnett
NRAO/NM





More information about the lustre-discuss mailing list