[Lustre-discuss] controlling which eth interface lustre uses

Thu Oct 21 09:38:37 PDT 2010

Charles Taylor wrote:
> On Oct 21, 2010, at 9:51 AM, Brock Palen wrote:
> 
>> On Oct 21, 2010, at 9:48 AM, Joe Landman wrote:
>>
>>> On 10/21/2010 09:37 AM, Brock Palen wrote:
>>>> We recently added a new oss, it has 1 1Gb interface and 1 10Gb
>>>> interface,
>>>>
>>>> The 10Gb interface is eth4 10.164.0.166 The 1Gb   interface is eth0
>>>> 10.164.0.10
>>> They look like they are on the same subnet if you are using /24 ...
>> You are correct
>>
>> Both interfaces are on the same subnet:
>>
>> [root at oss4-gb ~]# route
>> Kernel IP routing table
>> Destination     Gateway         Genmask         Flags Metric Ref     
>> Use Iface
>> 10.164.0.0      *               255.255.248.0   U     0       
>> 0        0 eth0
>> 10.164.0.0      *               255.255.248.0   U     0       
>> 0        0 eth4
>> 169.254.0.0     *               255.255.0.0     U     0       
>> 0        0 eth4
>> default         10.164.0.1      0.0.0.0         UG    0       
>> 0        0 eth0
>>
>> There is no way to mask the lustre service away from the 1Gb  
>> interface?
> 
> We struggle with this as well but have not found a way to enforce  
> it.   You would think that lustre would honor the NID for incoming  
> *and* outgoing traffic but apparently the standard linux routing table  
> determines the outbound path and lnet is out of the picture.     Thus,  
> you end up having to assign separate subnets, shut down your eth0 (in  
> this case) interface, or use static routes to fine tune the routing  
> decisions (where possible).
> 
> We wish that the outgoing decision could be made on the basis of the  
> *NID* but that might be too intrusive with regard to the linux  
> kernel's network stack so I can understand, somewhat, why it is not  
> that way.   Still, it is somewhat counter-intuitive to go through all  
> the trouble of having the LNET layer and assigning NIDs only to have  
> them disregarded for outbound traffic.
> 
> Perhaps there is a way around this that we don't know about.

Source based routing. You need both to make sure that each interface 
ignores arp requests to the other IP, and that traffic from the 10Gig IP 
is routed out of that card.

This is the way I solved the problem:

#!/bin/sh
# Script to use policy based routing to ensure lustre traffic goes in 
and out from eth2.
# First make sure that eth0 and eth2 only respond to arp requests for 
their own ip
echo  1 >/proc/sys/net/ipv4/conf/all/arp_ignore

# Now add a source based route - if the route is from the ip address of 
eth2, then send traffic via it
ip route add 10.1.0.0/16 dev eth2 tab 2
ip rule add from $(ifconfig eth2 | awk 'BEGIN {FS="[ :]+"};/inet 
addr/{print $4}') tab 2 priority 600

Having said this, I don't think it's what I'd set up now.
I'd use IPMI to get a serial console on the machine as my back door 
and/or use LACP bonding (can't remember which mode). If you do this, and 
IPMI shares the same physical port as eth0, then it is probably best to 
use eth1 as the failover link[1].

Chris
[1] We had a brief try with IPMI with eth0 and eth1 bonded - DHCP 
packets got out, but the replies didn't get back. Presumably the switch 
is sending the reply to eth1 rather than eth0 (swapping the physical 
cables around was suggested, but we didn't try this).