[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?

David Merhar merhar at arlut.utexas.edu
Thu Jan 27 10:09:22 PST 2011


Appreciate the input.

We've been using mode 6 as I expect it provides the fewest  
configuration pratfalls.  IF the single stream becomes our bottleneck  
we'll mess with aggregation.

What I can't find is the bottleneck in our current setup.  With 4  
servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput  
where each client has a single connection to each OST.   Instead we're  
limited to 2GB, where each OSS appears limited to 1Gb of I/O.   The  
strangeness is that iptraf on the OSSs shows traffic through the  
expected connections (2 X 2) but at only 35% - 65% of bandwidth.

And a third client writing to the filesystem will briefly increase  
aggregate throughput, but it quickly settles back to ~2Gb.

djm




On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:

> Normally if you are having a problem with write BW, you need to futz  
> with the switch.  If you were having
> problems with read BW, you need to futz with the server's config  
> (xmit hash policy is the usual culprit).
>
> Are you testing multiple clients to the same server?
>
> Are you using mode 6 because you don't have bonding support in your  
> switch?  I normally use 802.3ad mode,
> assuming your switch supports link aggregation.
>
>
> I was bonding 2x1Gb links for Lustre back in 2004.  That was before  
> BOND_XMIT_POLICY_LAYER34
> was in the kernel, so I had to hack the bond xmit hash (with  
> multiple NICs standard, layer2 hashing does not
> produce a uniform distribution, and can't work if going through a  
> router).
>
> Any one connection (socket or node/node connection) will use only  
> one gigabit link.  While it is possible
> to use two links using round-robin, that normally only helps for  
> client reads (server can't choose which link to
> receive data, the switch picks that), and has the serious downside  
> of out-of-order packets on the TCP stream.
>
> [If you want clients to have better client bandwidth for a single  
> file, change your default stripe count to 2, so it
> will hit two different servers.]
>
> Kevin
>
>
> David Merhar wrote:
>> Sorry - little b all the way around.
>>
>> We're limited to 1Gb per OST.
>>
>> djm
>>
>>
>>
>> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>>
>>
>>> I guess you have two gigabit nics bonded in mode 6 and not two  
>>> 1GB  nics?
>>> (B-Bytes, b-bits) The max aggregate throughput could be about  
>>> 200MBps
>>> out of the 2 bonded nics. I think the mode 0 bonding works only with
>>> cisco etherchannel or something similar on the switch side. Same  
>>> with
>>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>>> throughout. Maybe you could also see the max read and write   
>>> capabilities
>>> of the raid controller other than just the network. When testing  
>>> with
>>> dd, some of the data remains as dirty data till its flushed into the
>>> disk. I think the default background ratio is 10% for rhel5 which   
>>> would
>>> be sizable if your oss have lots of ram. There is chance of lockup  
>>> of
>>> the oss once it hits the dirty_ratio limit,which is 40% by  
>>> default.  So a
>>> bit more aggressive flush to disk by lowering the  
>>> background_ratio  and a
>>> bit more headroom before it hits the dirty_ratio is generally   
>>> desirable
>>> if your raid controller could keep up with it. So with your current
>>> setup, i guess you could get a max of 400MBps out of both OSS's  
>>> if  they
>>> both have two 1Gb nics in them. Maybe if you have one of the  
>>> switches
>>> from Dell that has 4 10Gb ports in them (their powerconnect  
>>> 6248),  10Gb
>>> nics for your OSS's might be a cheaper way to increase the aggregate
>>> performance. I think over 1GBps from a client is possible in  
>>> cases  where
>>> you use infiniband and rdma to deliver data.
>>>
>>>
>>> David Merhar wrote:
>>>
>>>> Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
>>>> write throughput each.
>>>>
>>>> Our setup:
>>>> 2 OSS serving 1 OST each
>>>> Lustre 1.8.5
>>>> RHEL 5.4
>>>> New Dell M610's blade servers with plenty of CPU and RAM
>>>> All SAN fibre connections are at least 4GB
>>>>
>>>> Some notes:
>>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the   
>>>> OSS's
>>>> fibre wire speed.
>>>> - A single client will get 2GB of lustre write speed, the client's
>>>> ethernet wire speed.
>>>> - We've tried bond mode 6 and 0 on all systems.  With mode 6 we  
>>>> will
>>>> see both NICs on both OSSs receiving data.
>>>> - We've tried multiple OSTs per OSS.
>>>>
>>>> But 2 clients writing a file will get 2GB of total bandwidth to the
>>>> filesystems.  We have been unable to isolate any particular  
>>>> resource
>>>> bottleneck.  None of the systems (MDS, OSS, or client) seem to be
>>>> working very hard.
>>>>
>>>> The 1GB per OSS threshold is so consistent, that it almost  
>>>> appears by
>>>> design - and hopefully we're missing something obvious.
>>>>
>>>> Any advice?
>>>>
>>>> Thanks.
>>>>
>>>> djm
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110127/2821abe7/attachment.htm>


More information about the lustre-discuss mailing list