[Lustre-discuss] 1GB throughput limit on OST (1.8.5)?
merhar at arlut.utexas.edu
Thu Jan 27 10:09:22 PST 2011
Appreciate the input.
We've been using mode 6 as I expect it provides the fewest
configuration pratfalls. IF the single stream becomes our bottleneck
we'll mess with aggregation.
What I can't find is the bottleneck in our current setup. With 4
servers - 2 clients, two OSSs - I'd expect 4Gb of aggregate throughput
where each client has a single connection to each OST. Instead we're
limited to 2GB, where each OSS appears limited to 1Gb of I/O. The
strangeness is that iptraf on the OSSs shows traffic through the
expected connections (2 X 2) but at only 35% - 65% of bandwidth.
And a third client writing to the filesystem will briefly increase
aggregate throughput, but it quickly settles back to ~2Gb.
On Jan 27, 2011, at 11:16 AM, Kevin Van Maren wrote:
> Normally if you are having a problem with write BW, you need to futz
> with the switch. If you were having
> problems with read BW, you need to futz with the server's config
> (xmit hash policy is the usual culprit).
> Are you testing multiple clients to the same server?
> Are you using mode 6 because you don't have bonding support in your
> switch? I normally use 802.3ad mode,
> assuming your switch supports link aggregation.
> I was bonding 2x1Gb links for Lustre back in 2004. That was before
> was in the kernel, so I had to hack the bond xmit hash (with
> multiple NICs standard, layer2 hashing does not
> produce a uniform distribution, and can't work if going through a
> Any one connection (socket or node/node connection) will use only
> one gigabit link. While it is possible
> to use two links using round-robin, that normally only helps for
> client reads (server can't choose which link to
> receive data, the switch picks that), and has the serious downside
> of out-of-order packets on the TCP stream.
> [If you want clients to have better client bandwidth for a single
> file, change your default stripe count to 2, so it
> will hit two different servers.]
> David Merhar wrote:
>> Sorry - little b all the way around.
>> We're limited to 1Gb per OST.
>> On Jan 27, 2011, at 7:48 AM, Balagopal Pillai wrote:
>>> I guess you have two gigabit nics bonded in mode 6 and not two
>>> 1GB nics?
>>> (B-Bytes, b-bits) The max aggregate throughput could be about
>>> out of the 2 bonded nics. I think the mode 0 bonding works only with
>>> cisco etherchannel or something similar on the switch side. Same
>>> the FC connection, its 4Gbps (not 4GBps) or about 400-500 MBps max
>>> throughout. Maybe you could also see the max read and write
>>> of the raid controller other than just the network. When testing
>>> dd, some of the data remains as dirty data till its flushed into the
>>> disk. I think the default background ratio is 10% for rhel5 which
>>> be sizable if your oss have lots of ram. There is chance of lockup
>>> the oss once it hits the dirty_ratio limit,which is 40% by
>>> default. So a
>>> bit more aggressive flush to disk by lowering the
>>> background_ratio and a
>>> bit more headroom before it hits the dirty_ratio is generally
>>> if your raid controller could keep up with it. So with your current
>>> setup, i guess you could get a max of 400MBps out of both OSS's
>>> if they
>>> both have two 1Gb nics in them. Maybe if you have one of the
>>> from Dell that has 4 10Gb ports in them (their powerconnect
>>> 6248), 10Gb
>>> nics for your OSS's might be a cheaper way to increase the aggregate
>>> performance. I think over 1GBps from a client is possible in
>>> cases where
>>> you use infiniband and rdma to deliver data.
>>> David Merhar wrote:
>>>> Our OSS's with 2x1GB NICs (bonded) appear limited to 1GB worth of
>>>> write throughput each.
>>>> Our setup:
>>>> 2 OSS serving 1 OST each
>>>> Lustre 1.8.5
>>>> RHEL 5.4
>>>> New Dell M610's blade servers with plenty of CPU and RAM
>>>> All SAN fibre connections are at least 4GB
>>>> Some notes:
>>>> - A direct write (dd) from a single OSS to the OST gets 4GB, the
>>>> fibre wire speed.
>>>> - A single client will get 2GB of lustre write speed, the client's
>>>> ethernet wire speed.
>>>> - We've tried bond mode 6 and 0 on all systems. With mode 6 we
>>>> see both NICs on both OSSs receiving data.
>>>> - We've tried multiple OSTs per OSS.
>>>> But 2 clients writing a file will get 2GB of total bandwidth to the
>>>> filesystems. We have been unable to isolate any particular
>>>> bottleneck. None of the systems (MDS, OSS, or client) seem to be
>>>> working very hard.
>>>> The 1GB per OSS threshold is so consistent, that it almost
>>>> appears by
>>>> design - and hopefully we're missing something obvious.
>>>> Any advice?
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss