[Lustre-discuss] Network name o2ib0 collision in two discrete filesystems
James Robnett
jrobnett at aoc.nrao.edu
Tue Sep 9 04:04:58 PDT 2014
I'm having difficulty figuring out a solution to an LNET issue I'm having.
We have two Lustre filesystems separated by about 60 miles, both of
which have o2ib0(ib0) and tcp(eth0) networks defined. Both have IB and
TCP clients which work just fine.
I'll call them FS1 and FS2.
FS1-mds at ib0 192.168.1.11
FS1-mds at eth0 10.1.1.11
FS2-mds at ib0 192.168.2.11
FS2-mds at eth0 10.1.2.11
We have a need for a client physically at site-1 to mount the
filesystems from both sites. The intent is to mount the local FS1 via
IB0 and the remote FS2 via TCP0 (accessible over gbit).
The mount commands for the client are:
mount −t lustre 192.168.1.11 at o2ib0:/lustre /lustre/FS1
mount −t lustre 10.1.2.11 at tcp0:/lustre /lustre/FS2
If I set this client's modprobe.conf line as
options network=o2ib0(ib0), tcp0(eth0)
then it mounts FS1 without issue but then fails on FS2 since it tries to
communicate via o2ib0 despite the mount command specifying tcp0.
Presumably since the client asserts it knows about both o2ib0 and tcp0
without realizing o2ib0 at site1 is functionally different from o2ib0 at
site2.
If I set the client's modprobe.conf line as
options network=tcp0(eth0), o2ib0(ib0)
then it mounts FS1 just fine but actually communicates via TCP0 (visible
through /proc/sys/lnet/peers) since there's a network path that works
and it's first in the list. It also mounts FS2 just fine as expected.
So I can mount on or the other but not both or at least not both in the
way that we need (i.e. IB for site1 and TCP for site2).
I'd begun looking into setting up an LNET router at site2 but I'm
suspicious that won't actually help or it will help but only if I set it
up in such a way that it disturbs existing IB0 and TCP0 clients there.
I tried briefly to set up an LNET router at site 1 that only knew about
tcp0. I put a routes line on the client pointing tcp0 at <lnetIP>@tcp0.
The LNET router can see and lctl ping the FS2 MDS but the client throws
an error on startup and doesn't seem to believe there's a route.
I'm beginning to sense that the only real option is to get rid of the IB
name collision and do a tunefs at site2 and change the servers and
clients to use o2ib1 rather than o2ib0, or other permutations of
renaming networks, but maybe (hopefully) I'm missing something with lnet
routing.
On a side note it's mildly confusing that the ordering of lnet options
networks= line takes precedence over the mount command. If that weren't
the case then either modprobe.conf line ordering above would work rather
than neither but maybe there's a case I'm missing that requires that
lnet option ordering takes precedence over the mount syntax.
Of course there's the very real possibility I'm missing an obvious
simple solution.
James Robnett
NRAO/NM
More information about the lustre-discuss
mailing list