[Lustre-discuss] Lustre routers capabilities

Fri Apr 11 05:14:44 PDT 2008

Yes thank you Mark, that was really the piece of information I was 
looking for: among all available routes, LNET is smart enough to use 
them equally. So I can define all the routes on all my clients and all 
my servers, LNET will take care to use all routers equally.

Sebastien.

D. Marc Stearman a écrit :
> The routing configuration is in /etc/modprobe.conf, and routes can  
> also be dynamically added with lctl add_route.  All of our lustre  
> servers are tcp servers, and we have client clusters that are tcp  
> only, and we also have IB and Elan clusters.  Let's say you want to  
> add a router to one of the IB clusters:
> 
> We'll assume that there is either a free port on the IB fabric, or we  
> change a compute node into a router node by adding some 10GigE hardware.
> 
> The IB cluster is o2ib0
> The Lustre cluster is using tcp0
> The IB cluster routers have connections on o2ib0 and tcp0.
> 
> Assuming you have an existing setup in place using either ip2nets or  
> networks in your modprobe.conf, and that you have existing routes  
> listed in the modprobe.conf, adding a router should be simple.  On  
> the client side, add the routes in the modprobe.conf, and on the  
> lustre servers add the routes to the modprobe.conf.  On the new  
> router, make sure it has the same modprobe.conf as the existing  
> routers.  This will ensure that the configuration works after a  
> reboot.  Since these are production clusters, we don't want to reboot  
> any of them, so we need to add the routes dynamically.  Lets say that  
> the new router has IP address 172.16.1.100 on tcp0 and  
> 192.168.120.100 on o2ib0, you would need to run the following commands:
> 
> On lustre servers:
> lctl --net o2ib0 add_route 172.16.1.100 at tcp0
> 
> On IB clients:
> lctl --net tcp0 add_route 192.168.120.100 at o2ib0
> 
> The clients/servers will add the routes, and they will be down until  
> you start LNET on the new router:
> service lnet start
> 
> At LLNL, we have one large tcp0 that all the server clusters belong  
> to, and LNET is smart enough to use all the routers equally, so once  
> we add a new router, it just becomes part of the router pool for that  
> cluster, thereby increasing the bandwidth of that cluster.
> 
> In reality, we rarely add new routers.  We typcially spec out what we  
> call scalable units, so when we add onto a compute cluster, we add a  
> known chunk of compute servers, with a known number of routers.  For  
> example, if a scalable unit is 144 compute nodes, and 4 IB/10 GigE  
> routers, then we may buy 4 scalable units, ending up with 576 compute  
> nodes, and 16 lustre routers.
> 
> Hope that helps answer your question.
> 
> -Marc
> 
> ----
> D. Marc Stearman
> LC Lustre Administration Lead
> marc at llnl.gov
> 925.423.9670
> Pager: 1.888.203.0641
> 
> 
> 
> On Apr 10, 2008, at 9:20 AM, Sébastien Buisson wrote:
>> Hello Marc,
>>
>> Thank you for this feedback. This is a very exhaustive description of
>> how you set up routers at the LLNL.
>> Just one question however: according to you, a simple way to increase
>> routing bandwidth is to add more Lustre routers, so that they are not
>> the bottleneck in the cluster. But at the LLNL, how do you deal with
>> Lustre routing configuration when you add new routers? I mean, how is
>> the network load balanced between all routers? Is it done in a dynamic
>> way that supports adding or removal of routers?
>>
>> Sebastien.
>>
>>
>> D. Marc Stearman a écrit :
>>> Sebastien,
>>>
>>> For the most part we try to match the bandwidth of the disks, to the
>>> network, to the number of routers needed.  I will be at the Lustre
>>> User Group meeting in Sonoma, CA at the end of this  month giving a
>>> talk about Lustre at LLNL, including our network design, and router
>>> usage, but here is a quick description.
>>>
>>> We have a large federated ethernet core.  We then have edge switches
>>> for each of our clusters that have links up to the core, and back
>>> down to the routers or tcp-only clients.  In a typical situation, if
>>> we think one file system can achieve 20 GB/s based on disk bandwidth,
>>> we try to make sure that the filesystem cluster has 20 GB/s network
>>> bandwith (1GigE, 10GigE, etc), and that the routers for the compute
>>> cluster total up to 20 GB/s as well.  So we may have a server cluster
>>> with servers having dual GigE links, and routers with 10 GigE links,
>>> and we just try to match them up so the numbers are even.  Typically,
>>> the routers in a cluster are the same node type as the compute
>>> cluster, just populated with additional network hardware.
>>>
>>> In the future, we will likely be building a router cluster that will
>>> bridge our existing federated ethernet core to a large Infinband
>>> network, but that is at least one year away.
>>>
>>> Most of our routers are rather simple, the have one high speed
>>> interconnect HCA (Quadrics, Mellanox IB), and one network card ( dual
>>> GigE, or single 10 GigE).  I don't think we've hit any bus bandwidth
>>> limitation, and I haven't seen any of them really pressed for CPU or
>>> Memory.  We do make sure to turn of irq_affinity when we have a
>>> single network interface (the 10 GigE routers), and we've had to tune
>>> the buffers and credits on the routers to get better throughput.  We
>>> have noticed a problem with serialization of checksum processing on a
>>> single core (bz #14690).
>>>
>>> The beauty of routers though, is that if you find that they are all
>>> running at capacity, you can always add a couple more, and move the
>>> bottleneck to the network or disks.  We find we are mostly slowed
>>> down by the disks.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> LC Lustre Administration Lead
>>> marc at llnl.gov
>>> 925.423.9670
>>> Pager: 1.888.203.0641
>>>
>>>
>>>
>>> On Apr 10, 2008, at 1:06 AM, Sébastien Buisson wrote:
>>>> Let's consider that the internal bus of the machine is bigger
>>>> enough so
>>>> that it will not be saturated. In that case, what will be the  
>>>> limiting
>>>> factor? memory? CPU?
>>>> I know that it depends on how many I/B cards are plugged in the
>>>> machine,
>>>> but generally speaking, is the routing activity CPU or memory  
>>>> hungry?
>>>>
>>>> By the way, are there people on that list that have feedback about
>>>> Lustre routers sizing? For instance, I know that Lustre routers have
>>>> been set up at the LLNL. What is the throughput obtained via the
>>>> routers, compared to the raw bandwidth of the interconnect?
>>>>
>>>> Thanks,
>>>> Sebastien.
>>>>
>>>>
>>>> Brian J. Murrell a écrit :
>>>>> On Wed, 2008-04-09 at 19:07 +0200, Sébastien Buisson wrote:
>>>>>> I mean, if I
>>>>>> have an available bandwith of 100 on each side of a router, what
>>>>>> will be
>>>>>> the max reachable bandwith from clients on one side of the  
>>>>>> router to
>>>>>> servers on the other side of the router? Is it 50? 80? 99? Is the
>>>>>> routing process CPU or memory hungry?
>>>>> While I can't answer these things specifically another important
>>>>> consideration is the bus architecture involved.  How many I/B
>>>>> cards can
>>>>> you put on a bus before you saturate the bus?
>>>>>
>>>>> b.
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------- 
>>>>> --
>>>>> ---
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
>