[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

Kevin Van Maren kevin.van.maren at oracle.com
Thu Jan 27 09:16:38 PST 2011


Normally if you are having a problem with write BW, you need to futz 
with the switch.  If you were having
problems with read BW, you need to futz with the server's config (xmit 
hash policy is the usual culprit).

Are you testing multiple clients to the same server?

Are you using mode 6 because you don't have bonding support in your 
switch?  I normally use 802.3ad mode,
assuming your switch supports link aggregation.


I was bonding 2x1Gb links for Lustre back in 2004.  That was before 
BOND_XMIT_POLICY_LAYER34
was in the kernel, so I had to hack the bond xmit hash (with multiple 
NICs standard, layer2 hashing does not
produce a uniform distribution, and can't work if going through a router).

Any one connection (socket or node/node connection) will use only one 
gigabit link.  While it is possible
to use two links using round-robin, that normally only helps for client 
reads (server can't choose which link to
receive data, the switch picks that), and has the serious downside of 
out-of-order packets on the TCP stream.

[If you want clients to have better client bandwidth for a single file, 
change your default stripe count to 2, so it
will hit two different servers.]

Kevin


David Merhar wrote:
> Sorry - little b all the way around.
>
> We're limited to 1Gb per OST.
>
> djm
>
>
>
> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>
>   
>> I guess you have two gigabit nics bonded in mode 6 and not two 1GB  
>> nics?
>> (B-Bytes, b-bits) The max aggregate throughput could be about 200MBps
>> out of the 2 bonded nics. I think the mode 0 bonding works only with
>> cisco etherchannel or something similar on the switch side. Same with
>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>> throughout. Maybe you could also see the max read and write  
>> capabilities
>> of the raid controller other than just the network. When testing with
>> dd, some of the data remains as dirty data till its flushed into the
>> disk. I think the default background ratio is 10% for rhel5 which  
>> would
>> be sizable if your oss have lots of ram. There is chance of lockup of
>> the oss once it hits the dirty_ratio limit,which is 40% by default.  
>> So a
>> bit more aggressive flush to disk by lowering the background_ratio  
>> and a
>> bit more headroom before it hits the dirty_ratio is generally  
>> desirable
>> if your raid controller could keep up with it. So with your current
>> setup, i guess you could get a max of 400MBps out of both OSS's if  
>> they
>> both have two 1Gb nics in them. Maybe if you have one of the switches
>> from Dell that has 4 10Gb ports in them (their powerconnect 6248),  
>> 10Gb
>> nics for your OSS's might be a cheaper way to increase the aggregate
>> performance. I think over 1GBps from a client is possible in cases  
>> where
>> you use infiniband and rdma to deliver data.
>>
>>
>> David Merhar wrote:
>>     
>>> Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
>>> write throughput each.
>>>
>>> Our setup:
>>> 2 OSS serving 1 OST each
>>> Lustre 1.8.5
>>> RHEL 5.4
>>> New Dell M610's blade servers with plenty of CPU and RAM
>>> All SAN fibre connections are at least 4GB
>>>
>>> Some notes:
>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the  
>>> OSS's
>>> fibre wire speed.
>>> - A single client will get 2GB of lustre write speed, the client's
>>> ethernet wire speed.
>>> - We've tried bond mode 6 and 0 on all systems.  With mode 6 we will
>>> see both NICs on both OSSs receiving data.
>>> - We've tried multiple OSTs per OSS.
>>>
>>> But 2 clients writing a file will get 2GB of total bandwidth to the
>>> filesystems.  We have been unable to isolate any particular resource
>>> bottleneck.  None of the systems (MDS, OSS, or client) seem to be
>>> working very hard.
>>>
>>> The 1GB per OSS threshold is so consistent, that it almost appears by
>>> design - and hopefully we're missing something obvious.
>>>
>>> Any advice?
>>>
>>> Thanks.
>>>
>>> djm
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list