[Lustre-discuss] lustre setup with a several subnets

Robert Pinnow, rise | fx robert at risefx.com
Fri Nov 13 06:54:23 PST 2009


hey lustre group,

i am currently working on my first lustre setup (centos 5.2, lustre
1.8.1) and as it is slightly more complicated on the network side than
the one in the quickstart guide, i am running into a few problems here.

*** 1. some of the oss can not mount their osts with a message to
*** 2. the client hangs and produces errors when trying to mount ..:/lustre

the setup looks like this:

1 MDS/CLIENT, having 10 (TEN) network interfaces, (six of the eths used
for lustre, the other four bonded as a cifs gateway), as this is only a
test setup, the MDS is also the client
3 OSSs, with 3 OSTs each, connected via 2 eths directly to the mds each

every interface used for lustre is setup to use a different subnet. all
the eths used for lustre are connected directly to the mds/client
without a switch.

OST1 on OSS1 via eth0 (192.168.16.101)
OST4 on OSS1 via eth1 (192.168.17.101)
OST7 on OSS1 via eth0 (192.168.16.101)

OST2 on OSS2 via eth0 (192.168.18.101)
OST5 on OSS2 via eth1 (192.168.19.101)
OST8 on OSS2 via eth0 (192.168.18.101)

OST3 on OSS3 via eth0 (192.168.10.101)
OST6 on OSS3 via eth1 (192.168.11.101)
OST9 on OSS3 via eth0 (192.168.10.101)

on the OST/CLIENT side eth0,1,6,7,8,9 are configured accordingly. a
linux ping works for all of them.

interfaces in the modprobe.conf are configured with

options lnet networks=tcp0(eth0),tcp1(eth1) # and of course all six on
the MDS/CLIENT + and just in case, the lo and the bond0

the filesystems were created with the option --mgsnode=IPADRESS at tcpN
according to the list above so the OSTs know which interface to use (do
they know?)

the output from the mds looks like it knows about its interfaces:
-----------------------------
[root at mds ~]# lctl list_nids
192.168.10.100 at tcp
192.168.11.100 at tcp1
127.0.0.1 at tcp2
192.168.1.205 at tcp5
192.168.16.100 at tcp6
192.168.17.100 at tcp7
192.168.18.100 at tcp8
192.168.19.100 at tcp9
-----------------------------

the three OSTs named here, are the ones that i where able to mount on oss3
-----------------------------
[root at mds ~]# lctl device_list
  0 UP mgs MGS MGS 5
  1 UP mgc MGC192.168.10.100 at tcp 95c7ba1b-a923-76f8-63fc-1a44201cf75e 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4
  4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 3
  5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
-----------------------------

for all the other OSSs trying to start the OSTs will produce the
following output:
-----------------------------
[root at oss1 ~]# mount -t lustre /dev/vg01/ost1 /mnt/ost1
mount.lustre: mount /dev/vg01/ost1 at /mnt/ost1 failed: Input/output error
Is the MGS running?
-----------------------------

and these dmesg notes:
-----------------------------
Lustre: Added LNI 192.168.16.101 at tcp [8/256/0/0]
Lustre: Added LNI 192.168.17.101 at tcp1 [8/256/0/0]
Lustre: Accept secure, port 988
usb 6-2: USB disconnect, address 2
Lustre: OBD class driver, http://www.lustre.org/
Lustre:     Lustre Version: 1.8.1.1
Lustre:     Build Version:
1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1
Lustre: Lustre Client File System; http://www.lustre.org/
kjournald starting.  Commit interval 5 seconds
LDISKFS FS on dm-2, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
LDISKFS FS on dm-2, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
LDISKFS-fs: file extents enabled
LDISKFS-fs: mballoc enabled
LustreError: 11140:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.16.100
LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host
192.168.16.100 on port 988 was reset: is it running a compatible version
of Lustre and is 192.168.16.100 at tcp one of its NIDs?
Lustre: 11140:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.16.101 at tcp->192.168.16.100 at tcp
Lustre: 12047:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319237993889793 sent from MGC192.168.16.100 at tcp to NID
192.168.16.100 at tcp 5s ago has timed out (limit 5s).
  req at ffff8100590c5400 x1319237993889793/t0
o250->MGS at MGC192.168.16.100@tcp_0:26/25 lens 368/584 e 0 to 1 dl
1258123398 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 12024:0:(obd_mount.c:1085:server_start_targets()) Required
registration failed for lustre-OSTffff: -5
LustreError: 15f-b: Communication error with the MGS.  Is the MGS running?
LustreError: 12024:0:(obd_mount.c:1629:server_fill_super()) Unable to
start targets: -5
LustreError: 12024:0:(obd_mount.c:1412:server_put_super()) no obd
lustre-OSTffff
LustreError: 12024:0:(obd_mount.c:136:server_deregister_mount())
lustre-OSTffff not registered
LDISKFS-fs: mballoc: 0 blocks 0 reqs (0 success)
LDISKFS-fs: mballoc: 0 extents scanned, 0 goal hits, 0 2^N hits, 0
breaks, 0 lost
LDISKFS-fs: mballoc: 0 generated and it took 0
LDISKFS-fs: mballoc: 0 preallocated, 0 discarded
Lustre: server umount lustre-OSTffff complete
LustreError: 12024:0:(obd_mount.c:1997:lustre_fill_super()) Unable to
mount  (-5)
-----------------------------

trying to mount this setup (with the three available OSTs to the CLIENT
(though 6 osts are still missing) with "mount -t lustre
[anyofmyinterfaces]:/lustre /mnt/lustre" will produce a hang until i
crtl+c.

a dmesg of the MDS/CLIENT looks like this:
-----------------------------
Lustre: OBD class driver, http://www.lustre.org/
Lustre:     Lustre Version: 1.8.1.1
Lustre:     Build Version:
1.8.1.1-20091009075116-PRISTINE-2.6.18-128.7.1.el5_lustre.1.8.1.1
Lustre: Added LNI 192.168.10.100 at tcp [8/256/0/0]
Lustre: Added LNI 192.168.11.100 at tcp1 [8/256/0/0]
Lustre: Added LNI 127.0.0.1 at tcp2 [8/256/0/0]
Lustre: Added LNI 192.168.1.205 at tcp5 [8/256/0/0]
Lustre: Added LNI 192.168.16.100 at tcp6 [8/256/0/0]
Lustre: Added LNI 192.168.17.100 at tcp7 [8/256/0/0]
Lustre: Added LNI 192.168.18.100 at tcp8 [8/256/0/0]
Lustre: Added LNI 192.168.19.100 at tcp9 [8/256/0/0]
Lustre: Accept secure, port 988
Lustre: Lustre Client File System; http://www.lustre.org/
kjournald starting.  Commit interval 5 seconds
LDISKFS FS on dm-0, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
kjournald starting.  Commit interval 5 seconds
LDISKFS FS on dm-0, internal journal
LDISKFS-fs: mounted filesystem with ordered data mode.
Lustre: MGS MGS started
Lustre: MGC192.168.10.100 at tcp: Reactivating import
Lustre: Enabling user_xattr
Lustre: MDT lustre-MDT0000 now serving lustre-MDT0000_UUID
(lustre-MDT0000/a387fd73-011c-aaa2-4d56-c85cf2b2228c) with recovery enabled
Lustre: 12081:0:(lproc_mds.c:271:lprocfs_wr_group_upcall())
lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups
Lustre: lustre-MDT0000.mdt: set parameter group_upcall=/usr/sbin/l_getgroups
Lustre: Server lustre-MDT0000 on device /dev/vg00/mds1 has started
Lustre: 11898:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -111
connecting 0.0.0.0/1023 -> 192.168.10.101/988
Lustre: 11898:0:(acceptor.c:88:lnet_connect_console_error()) Connection
to 192.168.10.101 at tcp at host 192.168.10.101 on port 988 was refused:
check that Lustre is running on that node.
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431495 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
5s ago has timed out (limit 5s).
  req at ffff8100d7264800 x1319226586431495/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112518 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 2
previous similar messages
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 35s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2
previous similar messages
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
LustreError: 11905:0:(events.c:66:request_out_callback()) @@@ type 4,
status -5  req at ffff81011d067000 x1319226586431526/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112803 ref 2 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 40s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2
previous similar messages
Lustre: 11901:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -111
connecting 0.0.0.0/1023 -> 192.168.10.101/988
Lustre: 11901:0:(acceptor.c:88:lnet_connect_console_error()) Connection
to 192.168.10.101 at tcp at host 192.168.10.101 on port 988 was refused:
check that Lustre is running on that node.
Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
eth1: Link is Down
e1000: eth0: e1000_watchdog_task: NIC Link is Down
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431530 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
45s ago has timed out (limit 45s).
  req at ffff810104d4c800 x1319226586431530/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112833 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5
previous similar messages
eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX/TX
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 45s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 2
previous similar messages
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
LustreError: 11905:0:(events.c:66:request_out_callback()) @@@ type 4,
status -5  req at ffff8100de21d000 x1319226586431535/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112888 ref 2 fl Rpc:N/0/0 rc 0/0
LustreError: 11905:0:(events.c:66:request_out_callback()) Skipped 2
previous similar messages
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
Lustre: 11905:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable
routes to 12345-192.168.10.101 at tcp
eth1: Link is Down
e1000: eth0: e1000_watchdog_task: NIC Link is Down
eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth0: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex,
Flow Control: RX/TX
e1000: eth0: e1000_watchdog_task: NIC Link is Down
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX/TX
Lustre: 11898:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113
connecting 0.0.0.0/1023 -> 192.168.10.101/988
Lustre: 11898:0:(acceptor.c:95:lnet_connect_console_error()) Connection
to 192.168.10.101 at tcp at host 192.168.10.101 was unreachable: the
network or that node may be down, or Lustre may be misconfigured.
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431539 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
55s ago has timed out (limit 55s).
  req at ffff8100de21d400 x1319226586431539/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112918 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5
previous similar messages
eth1: Link is Down
e1000: eth0: e1000_watchdog_task: NIC Link is Down
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 50s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 5
previous similar messages
eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX/TX
Lustre: 11899:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113
connecting 0.0.0.0/1023 -> 192.168.10.101/988
Lustre: 11899:0:(acceptor.c:95:lnet_connect_console_error()) Connection
to 192.168.10.101 at tcp at host 192.168.10.101 was unreachable: the
network or that node may be down, or Lustre may be misconfigured.
Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
Lustre: 11899:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.10.101 at tcp
eth1: Link is Down
eth1: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431545 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
55s ago has timed out (limit 55s).
  req at ffff8100dc6ca800 x1319226586431545/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258112993 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 2
previous similar messages
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 50s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 5
previous similar messages
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431557 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
55s ago has timed out (limit 55s).
  req at ffff81011ebda000 x1319226586431557/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258113143 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 5
previous similar messages
Lustre: 11906:0:(import.c:508:import_select_connection())
lustre-OST0000-osc: tried all connections, increasing latency to 50s
Lustre: 11906:0:(import.c:508:import_select_connection()) Skipped 11
previous similar messages
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1319226586431581 sent from lustre-OST0000-osc to NID 192.168.10.101 at tcp
55s ago has timed out (limit 55s).
  req at ffff8100da7abc00 x1319226586431581/t0
o8->lustre-OST0000_UUID at 192.168.10.101@tcp:28/4 lens 368/584 e 0 to 1 dl
1258113443 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 11905:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 11
previous similar messages
Lustre: MGC127.0.0.1 at tcp2: Reactivating import
LustreError: 120-3: Refusing connection from 192.168.16.100 for
192.168.16.100 at tcp: No matching NI
LustreError: 11901:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.16.100
LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host
192.168.16.100 on port 988 was reset: is it running a compatible version
of Lustre and is 192.168.16.100 at tcp one of its NIDs?
Lustre: 11901:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.16.100 at tcp
LustreError: 120-3: Refusing connection from 192.168.16.100 for
192.168.16.100 at tcp: No matching NI
LustreError: 11898:0:(socklnd_cb.c:1706:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.16.100
LustreError: 11b-b: Connection to 192.168.16.100 at tcp at host
192.168.16.100 on port 988 was reset: is it running a compatible version
of Lustre and is 192.168.16.100 at tcp one of its NIDs?
Lustre: 11898:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting
packet type 1 len 368 192.168.10.100 at tcp->192.168.16.100 at tcp
LustreError: 12157:0:(lov_obd.c:988:lov_cleanup()) lov tgt 0 not
cleaned! deathrow=0, lovrc=1
LustreError: 12157:0:(ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc
-108 from cancel RPC: canceling anyway
LustreError: 12157:0:(ldlm_request.c:1533:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Lustre: client ffff8101172ef800 umount complete
LustreError: 12157:0:(obd_mount.c:1997:lustre_fill_super()) Unable to
mount  (-4)
------------------------

any ideas?

THANK YOU!

robert



More information about the lustre-discuss mailing list