[Lustre-discuss] Lustre routers capabilities

Thu Apr 10 09:57:55 PDT 2008

The routing configuration is in /etc/modprobe.conf, and routes can  
also be dynamically added with lctl add_route.  All of our lustre  
servers are tcp servers, and we have client clusters that are tcp  
only, and we also have IB and Elan clusters.  Let's say you want to  
add a router to one of the IB clusters:

We'll assume that there is either a free port on the IB fabric, or we  
change a compute node into a router node by adding some 10GigE hardware.

The IB cluster is o2ib0
The Lustre cluster is using tcp0
The IB cluster routers have connections on o2ib0 and tcp0.

Assuming you have an existing setup in place using either ip2nets or  
networks in your modprobe.conf, and that you have existing routes  
listed in the modprobe.conf, adding a router should be simple.  On  
the client side, add the routes in the modprobe.conf, and on the  
lustre servers add the routes to the modprobe.conf.  On the new  
router, make sure it has the same modprobe.conf as the existing  
routers.  This will ensure that the configuration works after a  
reboot.  Since these are production clusters, we don't want to reboot  
any of them, so we need to add the routes dynamically.  Lets say that  
the new router has IP address 172.16.1.100 on tcp0 and  
192.168.120.100 on o2ib0, you would need to run the following commands:

On lustre servers:
lctl --net o2ib0 add_route 172.16.1.100 at tcp0

On IB clients:
lctl --net tcp0 add_route 192.168.120.100 at o2ib0

The clients/servers will add the routes, and they will be down until  
you start LNET on the new router:
service lnet start

At LLNL, we have one large tcp0 that all the server clusters belong  
to, and LNET is smart enough to use all the routers equally, so once  
we add a new router, it just becomes part of the router pool for that  
cluster, thereby increasing the bandwidth of that cluster.

In reality, we rarely add new routers.  We typcially spec out what we  
call scalable units, so when we add onto a compute cluster, we add a  
known chunk of compute servers, with a known number of routers.  For  
example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE  
routers, then we may buy 4 scalable units, ending up with 576 compute  
nodes, and 16 lustre routers.

Hope that helps answer your question.

-Marc

----
D. Marc Stearman
LC Lustre Administration Lead
marc at llnl.gov
925.423.9670
Pager: 1.888.203.0641

On Apr 10, 2008, at 9:20 AM, Sébastien Buisson wrote:
> Hello Marc,
>
> Thank you for this feedback. This is a very exhaustive description of
> how you set up routers at the LLNL.
> Just one question however: according to you, a simple way to increase
> routing bandwidth is to add more Lustre routers, so that they are not
> the bottleneck in the cluster. But at the LLNL, how do you deal with
> Lustre routing configuration when you add new routers? I mean, how is
> the network load balanced between all routers? Is it done in a dynamic
> way that supports adding or removal of routers?
>
> Sebastien.
>
>
> D. Marc Stearman a écrit :
>> Sebastien,
>>
>> For the most part we try to match the bandwidth of the disks, to the
>> network, to the number of routers needed.  I will be at the Lustre
>> User Group meeting in Sonoma, CA at the end of this  month giving a
>> talk about Lustre at LLNL, including our network design, and router
>> usage, but here is a quick description.
>>
>> We have a large federated ethernet core.  We then have edge switches
>> for each of our clusters that have links up to the core, and back
>> down to the routers or tcp-only clients.  In a typical situation, if
>> we think one file system can achieve 20 GB/s based on disk bandwidth,
>> we try to make sure that the filesystem cluster has 20 GB/s network
>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute
>> cluster total up to 20 GB/s as well.  So we may have a server cluster
>> with servers having dual GigE links, and routers with 10 GigE links,
>> and we just try to match them up so the numbers are even.  Typically,
>> the routers in a cluster are the same node type as the compute
>> cluster, just populated with additional network hardware.
>>
>> In the future, we will likely be building a router cluster that will
>> bridge our existing federated ethernet core to a large Infinband
>> network, but that is at least one year away.
>>
>> Most of our routers are rather simple, the have one high speed
>> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual
>> GigE, or single 10 GigE).  I don't think we've hit any bus bandwidth
>> limitation, and I haven't seen any of them really pressed for CPU or
>> Memory.  We do make sure to turn of irq_affinity when we have a
>> single network interface (the 10 GigE routers), and we've had to tune
>> the buffers and credits on the routers to get better throughput.  We
>> have noticed a problem with serialization of checksum processing on a
>> single core (bz #14690).
>>
>> The beauty of routers though, is that if you find that they are all
>> running at capacity, you can always add a couple more, and move the
>> bottleneck to the network or disks.  We find we are mostly slowed
>> down by the disks.
>>
>> -Marc
>>
>> ----
>> D. Marc Stearman
>> LC Lustre Administration Lead
>> marc at llnl.gov
>> 925.423.9670
>> Pager: 1.888.203.0641
>>
>>
>>
>> On Apr 10, 2008, at 1:06 AM, Sébastien Buisson wrote:
>>> Let's consider that the internal bus of the machine is bigger
>>> enough so
>>> that it will not be saturated. In that case, what will be the  
>>> limiting
>>> factor? memory? CPU?
>>> I know that it depends on how many I/B cards are plugged in the
>>> machine,
>>> but generally speaking, is the routing activity CPU or memory  
>>> hungry?
>>>
>>> By the way, are there people on that list that have feedback about
>>> Lustre routers sizing? For instance, I know that Lustre routers have
>>> been set up at the LLNL. What is the throughput obtained via the
>>> routers, compared to the raw bandwidth of the interconnect?
>>>
>>> Thanks,
>>> Sebastien.
>>>
>>>
>>> Brian J. Murrell a écrit :
>>>> On Wed, 2008-04-09 at 19:07 +0200, Sébastien Buisson wrote:
>>>>> I mean, if I
>>>>> have an available bandwith of 100 on each side of a router, what
>>>>> will be
>>>>> the max reachable bandwith from clients on one side of the  
>>>>> router to
>>>>> servers on the other side of the router? Is it 50? 80? 99? Is the
>>>>> routing process CPU or memory hungry?
>>>> While I can't answer these things specifically another important
>>>> consideration is the bus architecture involved.  How many I/B
>>>> cards can
>>>> you put on a bus before you saturate the bus?
>>>>
>>>> b.
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss