[Lustre-discuss] Odd connectivity problem...

Lundgren, Andrew Andrew.Lundgren at Level3.com
Fri Mar 7 11:22:19 PST 2008


Taking note from another thread, I ran some ping tests between my machines in rack b to their peers in both rack b and rack a.

The machines in rack b can ping those in a, but not those in b.  I get the following:

[root at dintnyc1304 ~]#  lctl ping 4.23.36.37 at tcp0<mailto:4.23.36.37 at tcp0>
failed to ping 4.23.36.37 at tcp<mailto:4.23.36.37 at tcp>: Input/output error
[root at dintnyc1304 ~]#  lctl ping 4.23.36.38 at tcp0<mailto:4.23.36.38 at tcp0>
failed to ping 4.23.36.38 at tcp<mailto:4.23.36.38 at tcp>: Input/output error
[root at dintnyc1304 ~]#  lctl ping 4.23.36.39 at tcp0<mailto:4.23.36.39 at tcp0>
failed to ping 4.23.36.39 at tcp<mailto:4.23.36.39 at tcp>: Input/output error
[root at dintnyc1304 ~]#  lctl ping 4.23.36.40 at tcp0<mailto:4.23.36.40 at tcp0>
12345-0 at lo<mailto:12345-0 at lo>
12345-4.23.36.40 at tcp<mailto:12345-4.23.36.40 at tcp>
[root at dintnyc1304 ~]#  lctl ping 4.23.36.41 at tcp0<mailto:4.23.36.41 at tcp0>
failed to ping 4.23.36.41 at tcp<mailto:4.23.36.41 at tcp>: Input/output error
The one that works above, is the the machine that I ran the ping from (It can ping itself).

--
Andrew

________________________________
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Lundgren, Andrew
Sent: Friday, March 07, 2008 12:03 PM
To: 'Lustre-discuss at clusterfs.com'
Subject: [Lustre-discuss] Odd connectivity problem...

I am trying to bring up our first "real" lustre cluster and running into an unexpected problem.

Let me dump out some background on the configuration first:

I am running CentOS 5.0 x86_64bit with Lustre 1.6.4.2.

I have 2 racks of machines located next to each other.  Each rack has its own router.

Each router has its own subnet.

Each machine has 2 NICs one that is connected into the local rack, and one connected into adjacent rack.

Each machine has local disks in it that serve as the OST storage.

Each machine has a lustre client running on it with visibility into the lustre cluster.

We are not using bonding, so each machine has visibility into each subnet via the cross connecting.

We have this options lnet networks=tcp0(eth1,eth0) in the modprobe.conf file.

We are not yet using a routing protocol yet.  The default route out for each machine it's local router connected to eth0.

ASCII Pictures:

Machines in Rack A are all wired like this:

RACK A                        RACK B
--------------------       ----------------------
|Router (4.23.37.1)|       |Router A (4.23.36.1)|
--------------------       ---------------------
    | eth0 (4.23.37.10)         |
----------                      |
| Machine |--eth1--(4.23.36.42)-|
---------- def gw (4.23.37.1)



Machines in Rack B are all wired like this:


RACK A                        RACK B
--------------------       ----------------------
|Router (4.23.37.1)|       |Router A (4.23.36.1)|
--------------------       ---------------------
        |                           | eth0 (4.23.36.10)
        |                      ----------
        --eth1--(4.23.37.42)--| Machine |
          def gw (4.23.36.1)  ----------


The primary MGS is at the top of one of the racks  (RACK A for this email.)

The machines are all up and running with lustre on them, they can ping/ssh between each other.

No for the problem,  All of the clients in rack A, the rack with the MGS can see all of the OSTs being served from all of the OSSs in both racks.
The clients in rack B can only see the MGS and the OSSes in rack A, none in their own rack, rack B.  They all show timeouts like this:

Mar  7 18:43:31 dintnyc1303 kernel: LustreError: 2914:0:(events.c:55:request_out_callback()) Skipped 275 previous similar messages
Mar  7 18:45:41 dintnyc1303 kernel: LustreError: 2914:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1204915536, 5s ago)  req at ffff81020327f800<mailto:req at ffff81020327f800> x123942/t0 o8->content-OST001c_UUID at 4.23.36.41@tcp:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/-22
Mar  7 18:45:41 dintnyc1303 kernel: LustreError: 2914:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 403 previous similar messages
Mar  7 18:53:31 dintnyc1303 kernel: LustreError: 2914:0:(events.c:55:request_out_callback()) @@@ type 4, status -5  req at ffff810203276000<mailto:req at ffff810203276000> x124871/t0 o8->content-OST002d_UUID at 4.23.36.46@tcp:6 lens 240/272 ref 2 fl Rpc:/0/0 rc 0/-22
Mar  7 18:53:31 dintnyc1303 kernel: LustreError: 2914:0:(events.c:55:request_out_callback()) Skipped 278 previous similar messages
Mar  7 18:56:01 dintnyc1303 kernel: LustreError: 2914:0:(client.c:975:ptlrpc_expire_one_request()) @@@ network error (sent at 1204916161, 0s ago)  req at ffff8101f84dae00<mailto:req at ffff8101f84dae00> x125140/t0 o8->content-OST001a_UUID at 4.23.36.40@tcp:6 lens 240/272 ref 1 fl Rpc:/0/0 rc 0/-22
Mar  7 18:56:01 dintnyc1303 kernel: LustreError: 2914:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 445 previous similar messages

I am a bit stuck.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20080307/20a524d2/attachment.htm>


More information about the lustre-discuss mailing list