[Lustre-discuss] Dual Homed Filesystem Issue

Dennis Nelson dnelson at sgi.com
Wed Apr 21 13:07:29 PDT 2010


Hi All,

I have an existing Lustre filesystem that I want to make available to
another cluster.  Both clusters are IB based and the servers have dual IB
ports which are connected to the two IB fabrics of the clusters.

Cluster A, the original cluster, works fine with the filesystem.  The
clients for cluster B cannot mount the filesystem yet lctl ping does work.
Here are the details:

MGS/MDS: /etc/modprobe.conf.local

options lnet ip2nets="o2ib0(ib0) 10.149.0.*;o2ib1(ib1) 10.150.0.*"
ib0 IP: 10.149.0.69/16
ib1 IP: 10.150.0.69/16

lctl list_nids:
10.149.0.69 at o2ib
10.150.0.69 at o2ib1

OSS1: /etc/modprobe.conf.local:

options lnet ip2nets="o2ib0(ib0) 10.149.0.*;o2ib1(ib1) 10.150.0.*"
ib0 IP: 10.149.0.70/16
ib1 IP: 10.150.0.70/16

lctl list_nids:
10.149.0.70 at o2ib
10.150.0.70 at o2ib1

OSS2-3 similar to OSS1
OSS2 ib0 IP: 10.149.0.71/16
OSS2 ib1 IP: 10.150.0.71/16

OSS3 ib0 IP: 10.149.0.72/16
OSS3 ib1 IP: 10.150.0.72/16

Cluster A Client:  /etc/modprobe.conf.local

options lnet networks=o2ib(ib1)

ib1 IP: 10.149.0.2/16

# lctl list_nids
10.149.0.2 at o2ib


Cluster B Client: /etc/modprobe.conf.local

options lnet ip2nets="o2ib1(ib0) 10.150.0.*"

ib0 IP 10.150.0.120/16

# lctl list_nids
10.150.0.120 at o2ib1

---

I can use lctl ping to ping the MDS/MGS:

Clinet B# lctl ping 10.150.0.69 at o2ib1
12345-0 at lo
12345-10.149.0.69 at o2ib
12345-10.150.0.69 at o2ib1

And the other direction - client to MGS/MDS:

mds # # lctl ping 10.150.0.120 at o2ib1
12345-0 at lo
12345-10.150.0.120 at o2ib1

Yet, the mount continues to fail.  I see the following messages in
/var/log/messages on the client:

Apr 21 14:37:02 gpute-47 kernel: Lustre: MGC10.150.0.69 at o2ib1: Reactivating
import
Apr 21 14:37:02 gpute-47 kernel: LustreError:
21187:0:(events.c:460:ptlrpc_uuid_to_peer()) No NID found for
10.149.0.69 at o2ib
Apr 21 14:37:02 gpute-47 kernel: LustreError:
21187:0:(client.c:69:ptlrpc_uuid_to_connection()) cannot find peer
10.149.0.69 at o2ib!
Apr 21 14:37:02 gpute-47 kernel: LustreError:
21187:0:(ldlm_lib.c:334:client_obd_setup()) can't add initial connection
Apr 21 14:37:02 gpute-47 kernel: LustreError:
21187:0:(obd_config.c:363:class_setup()) setup
lustre-MDT0000-mdc-ffff810322afe400 failed (-2)
Apr 21 14:37:02 gpute-47 kernel: LustreError:
21187:0:(obd_config.c:1102:class_config_llog_handler()) Err -2 on cfg
command:
Apr 21 14:37:02 gpute-47 kernel: Lustre:    cmd=cf003 0:lustre-MDT0000-mdc
1:lustre-MDT0000_UUID  2:10.149.0.69 at o2ib
Apr 21 14:37:02 gpute-47 kernel: LustreError: 15c-8: MGC10.150.0.69 at o2ib1:
The configuration from log 'lustre-client' failed (-2). This may be the
result of communication errors between this node and the MGS, a bad
configuration, or other errors. See the syslog for more information.
Apr 21 14:37:02 gpute-47 kernel: LustreError:
20920:0:(llite_lib.c:1064:ll_fill_super()) Unable to process log: -2
Apr 21 14:37:02 gpute-47 kernel: LustreError:
20920:0:(obd_config.c:430:class_cleanup()) Device 2 not setup
Apr 21 14:37:03 gpute-47 kernel: LustreError:
20920:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel
RPC: canceling anyway
Apr 21 14:37:03 gpute-47 kernel: LustreError:
20920:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list:
-108
Apr 21 14:37:03 gpute-47 kernel: Lustre: client ffff810322afe400 umount
complete
Apr 21 14:37:03 gpute-47 kernel: LustreError:
20920:0:(obd_mount.c:1991:lustre_fill_super()) Unable to mount  (-2)

It appears that the client is still looking for the 10.149.xxx NID but does
not find it.  The other part that bothers me is that there are no messages
at all on the MGS/MDS server when I attempt to mount the client.  I would
have expected some sort of failure message unless it is not reaching it at
all.

Any ideas?







More information about the lustre-discuss mailing list