[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
David Merhar
merhar at arlut.utexas.edu
Thu Jan 27 10:09:22 PST 2011
Appreciate the input.
We've been using mode 6 as I expect it provides the fewest
configuration pratfalls. IF the single stream becomes our bottleneck
we'll mess with aggregation.
What I can't find is the bottleneck in our current setup. With 4
servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput
where each client has a single connection to each OST. Instead we're
limited to 2GB, where each OSS appears limited to 1Gb of I/O. The
strangeness is that iptraf on the OSSs shows traffic through the
expected connections (2 X 2) but at only 35% - 65% of bandwidth.
And a third client writing to the filesystem will briefly increase
aggregate throughput, but it quickly settles back to ~2Gb.
djm
On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:
> Normally if you are having a problem with write BW, you need to futz
> with the switch. If you were having
> problems with read BW, you need to futz with the server's config
> (xmit hash policy is the usual culprit).
>
> Are you testing multiple clients to the same server?
>
> Are you using mode 6 because you don't have bonding support in your
> switch? I normally use 802.3ad mode,
> assuming your switch supports link aggregation.
>
>
> I was bonding 2x1Gb links for Lustre back in 2004. That was before
> BOND_XMIT_POLICY_LAYER34
> was in the kernel, so I had to hack the bond xmit hash (with
> multiple NICs standard, layer2 hashing does not
> produce a uniform distribution, and can't work if going through a
> router).
>
> Any one connection (socket or node/node connection) will use only
> one gigabit link. While it is possible
> to use two links using round-robin, that normally only helps for
> client reads (server can't choose which link to
> receive data, the switch picks that), and has the serious downside
> of out-of-order packets on the TCP stream.
>
> [If you want clients to have better client bandwidth for a single
> file, change your default stripe count to 2, so it
> will hit two different servers.]
>
> Kevin
>
>
> David Merhar wrote:
>> Sorry - little b all the way around.
>>
>> We're limited to 1Gb per OST.
>>
>> djm
>>
>>
>>
>> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>>
>>
>>> I guess you have two gigabit nics bonded in mode 6 and not two
>>> 1GB nics?
>>> (B-Bytes, b-bits) The max aggregate throughput could be about
>>> 200MBps
>>> out of the 2 bonded nics. I think the mode 0 bonding works only with
>>> cisco etherchannel or something similar on the switch side. Same
>>> with
>>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>>> throughout. Maybe you could also see the max read and write
>>> capabilities
>>> of the raid controller other than just the network. When testing
>>> with
>>> dd, some of the data remains as dirty data till its flushed into the
>>> disk. I think the default background ratio is 10% for rhel5 which
>>> would
>>> be sizable if your oss have lots of ram. There is chance of lockup
>>> of
>>> the oss once it hits the dirty_ratio limit,which is 40% by
>>> default. So a
>>> bit more aggressive flush to disk by lowering the
>>> background_ratio and a
>>> bit more headroom before it hits the dirty_ratio is generally
>>> desirable
>>> if your raid controller could keep up with it. So with your current
>>> setup, i guess you could get a max of 400MBps out of both OSS's
>>> if they
>>> both have two 1Gb nics in them. Maybe if you have one of the
>>> switches
>>> from Dell that has 4 10Gb ports in them (their powerconnect
>>> 6248), 10Gb
>>> nics for your OSS's might be a cheaper way to increase the aggregate
>>> performance. I think over 1GBps from a client is possible in
>>> cases where
>>> you use infiniband and rdma to deliver data.
>>>
>>>
>>> David Merhar wrote:
>>>
>>>> Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
>>>> write throughput each.
>>>>
>>>> Our setup:
>>>> 2 OSS serving 1 OST each
>>>> Lustre 1.8.5
>>>> RHEL 5.4
>>>> New Dell M610's blade servers with plenty of CPU and RAM
>>>> All SAN fibre connections are at least 4GB
>>>>
>>>> Some notes:
>>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the
>>>> OSS's
>>>> fibre wire speed.
>>>> - A single client will get 2GB of lustre write speed, the client's
>>>> ethernet wire speed.
>>>> - We've tried bond mode 6 and 0 on all systems. With mode 6 we
>>>> will
>>>> see both NICs on both OSSs receiving data.
>>>> - We've tried multiple OSTs per OSS.
>>>>
>>>> But 2 clients writing a file will get 2GB of total bandwidth to the
>>>> filesystems. We have been unable to isolate any particular
>>>> resource
>>>> bottleneck. None of the systems (MDS, OSS, or client) seem to be
>>>> working very hard.
>>>>
>>>> The 1GB per OSS threshold is so consistent, that it almost
>>>> appears by
>>>> design - and hopefully we're missing something obvious.
>>>>
>>>> Any advice?
>>>>
>>>> Thanks.
>>>>
>>>> djm
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110127/2821abe7/attachment.htm>
More information about the lustre-discuss
mailing list