[Lustre-discuss] Slow read performance across OSSes

Thu Oct 22 12:48:09 PDT 2009

   The problem appears to be network congestion control on the
OSSes triggered by these Cisco 2960 switches inability to deal
with over-subscription very well.

   The problem even occurs if the client has a channel bonded
2 x 1gbit interface pair and only two OSSes are involved.
Sadly it was that result that led me to believe the problem was
on the client and not the switch or the OSSes.

    I connected the OSSes and client to an el cheapo Allied
Telesyn 8 port 1gbit switch. A client with a single 1gbit interface
and a test with two OSSes resulted in 116MB writes and reads.

   A second test involving 4 OSSes (each with two OST's) reverted
to the 116MB/s writes and 40'ish MB/s reads which implies the
AT switch is better but there's still a problem.

   Looking at the OSSes I discovered some sub-optimal IP stack
settings.  In particular

net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0

   Setting those both to *1* improved the AT switch case to about 78MB/s
reads across 4 OSSes but that switch doesn't support 9000 MTU.

   Fixing up the OSSes with those IP settings and returning to the
original switch (which does support 9000MTU) seems to be the best case.

Across 4 OSSes w/8 OSTs w/4MB stripe size 9000 MTU
115MB/s writes, 106MB/s reads

Across 4 OSSes w/8 OSTs w/1MB stripe size 9000 MTU
115MB/s writes, 111MB/s reads

   So for now I'd say it's all better though I'll be suspicious of our
settings till I see a scaled up version running on a newer switch
with full throughput.

   I did some "site:lists.lustre.org <string>" type searches for
congestion, tcp sysctl and came up with very little.  Are there
best practice type tcp settings for Lustre in 1gbit with channel
bonding (as opposed to IB or 10G) type environments.  We have our
own set here we've empirically settled on.

James Robnett

   Here are all the changes we make beyond stock settings.

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 3000
# Added for Lustre
net.ipv4.tcp_sack = 1
net.ipv4.tcp_timestamps = 1