[Lustre-discuss] bursty instead of even write performance

Mon Jul 25 06:25:59 PDT 2011

> I have a system to play with here, consisting of 10 HP DL320
> with 12x750GB drives each, attached to SmartArray controllers.

That's good because it is a lot of disk arms (Quite a few people
seem to underestimate IOPS and go for fewer larger drives).

Those are reasonable systems for IO especially if they are
configured for a non-demented RAID layout (low chances of that of
course :->).

> Machines are few years old and possibly underpowered, equipped
> with xeon 3060 and 2GB of RAM each. Each OSS is capable of
> about 120MB/s write and 240MB/s read.

If that's 12x750GB drives doing 120MB aggregate writes, ah well,
the "non-demented RAID layout" is indeed a forlorn hope.

> MDS is equally weak by today's standards, a DL140 with xeon
> 5110 and only a gig of RAM.

Note necessarily, that's a newer style machine (there seems to be
a really large difference in IO performance between pre-PCIe
chipsets and PCIe ones), as the 5110 implies it is a G4, and it
is a PCIe class machine.

The 1GiB is too small, but what really matters is the high IOPS
storage system, and you don't say what it is like. Perhaps you
could use one of the DL320s or a slice of one (or two, so you get
a backup MDS).

> OSSs are connected via bonded GigE to a switch

Bonded is often a bad idea, depending on which type of bonded,
sometimes dual independent interfaces is better.

> that has 10GigE connection to other switches, through which
> clients connect.

This seems to mean that the server switch has multiple 10Gb/s
connections, else the aggregate read rate of 1.5GB/s is hard to
explain.

In the write case you need to carefully look at switch setup and
Linux network setup on the servers, as you have a situation where
incoming traffic on those 10Gb/s links gets distributed to a set
of 20x 1Gb/s ports with lower aggregate capacity, and lower
individual speeds.

> My clients are two racks of IBM machines (84 of them). I'm
> getting about 700MB/s write and 1.5GB/s read combined, [ ... ]

That's curious because Lustre usually does better at writing than
reading in simple benchmarks, especially for concurrentm read/write.
Perhaps the OST RAID setup could be revised :-), and the switch
and Linux network setups reviewed.

Your numbers come from 120 drives and 20 Ethernet interfaces
across servers for a total of around 70-80GB capacity and
2-2.2GB/s network transfer rate, and while the measured per-drive
and per-interface numbers do not seem high, the key here is
"combined" and that means that there is potential a significant
rate of seeking, so overall they seem reasonable.

BTW I assume that "combined" here means "concurrent", as in able
to do 700MB/s writes and 1500MB/s reads *at the same time*, and
not "aggregate", as in across the 84 clients, but only reading or
only writing.

> graph.gif shows combined write speed when each node was simply
> writing a large file using dd. Performance slowly drops as the
> disks get full, something I'm used to.

It is remarkable that you have profiled this, as I have noticed
that many people (e.g. GSI some years ago :->) seem surprised
that outer-inner track speed differencesa (and fragmentation)
mean that near-full disks are way slower than near-empty disks.

In your case speed perhaps should drop a lot more, typical disks
are 2x slower inner tracks than outer tracks, which probably
means that there is some limitation as to taking advantage of the
speed of the outer tracks.

But the 700MB/s write speed here seems "aggregate" rather than
"concurrent" as there is negligible reads driving up IOPS, and
the writes are purely sequential, and this rate is somewhat
disappointing, as that implies around 35MB/s per 1Gb/s interfaces
and 

> But nodes.gif shows write speed of each node, which shows
> things I'm not used to - long periods of no activity, then
> sudden bursts, then again nothing.

Why? Writing is heavily subject to buffering, both at the Linux
level and the Lustre level, and flushes happen rarely, often with
pretty bad consequences. With the default Linux etc. setup it
happens pretty often that some GB of "dirty" pages accumulate in
memory and then get "burped" out in one go congesting the storage
system.

> I would assume each client to have a steady and even write
> activity, if only at 8-9 MB/s, but that's not what I see.

Each OSS with 12x 750GB (Oeach of which capable of average
transfer rates of around 50MB/s) disks should be doing a few
hundred MB/s.

Ahh I now realize that 'nodes.gif' is the transfer rate of each
*client* node, not each *server* (OSS) node. In that case maximum
transfer rates of 20MB/s and averages which are much lower when
writing to 120 server disks, or roughly 1-1.5 server disks per
node are not that awesome.

> So, my question: Is what I'm observing an expected situation?

Depends on what you are observing, as you are not clear as to
what you are measuring. For example you don't state clearly where
the 'dd' is running and which parameters (e.g. 'bs' and 'iflag'
and 'oflag') are. Presumably it is running on multiple client
nodes otherwise you would not be getting 700-500MB/s aggregate
(unless the client node had a 10Gb/s interface), and the "nodes"
you refer to seem to be the client nodes (more than 10 graphs).

Maybe it would be interesting to measure with something like this:

  dd bs=1M count=10000 if=/dev/zero conv=fdatasync of=/lus/TEST10G

the speed of one OST mounted as 'ldiskfs', locally on the OSS,
for both optimal case read and write as in the above, to
check the upper bounds.

Then try the same on a single client with a 1Gb/s interface,
ideally a single client with a a 10Gb/s interface, and then 10
clients with a 1Gb/s interface (same number of clients as OSSes),
and then 20 clients with a a 1Gb/s interface (same number of
clients as total server interfaces), and then 40 clients with a
1Gb/s interface (more clients than server interfaces or servers,
and 3 server disks per client, which should be able to deliver
close to 1Gb/s between them).

> Or am I right that I should be seeing more balanced write
> activity from each node?

Well, depends on how much write buffering you configured,
explicitly or implicitly, in the Linux flusher and in Lustre.

But the big deal is not that you are getting bursty IO, but that
the numbers involved are not that awesome for 120 disks and 20
network interfaces across the servers.

The 20 network interfaces mean that each server can't do more
than 200-220MB/s in/out but since the 12 disks per server should
get you a lot more than that you should be getting close to max
utilization of those network interfaces.

Perhaps given that anyhow each OSS is limited to 200-220MB/s by
the 2 network interfaces you could reconfigure your storage
system to take advantage of that and go for lower local peak
transfer rate :-) and aim at latency and higher IOPS as you have
many clients (but that's perhaps a different discussion).

> Since all of the lustre settings are at their defaults, what
> should I look into to see if I can tune anything?

There are a few tuning guides with various settings, and
discussions in this mailing list, in particular as to RPCs.

I just did a web search with:

  +lustre +rpc write buffering rates OR speed OR performance

and got several relevant hits, e.g.:

  http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/lustre-hpc-technical%20bulletin-dell-cambridge-03022011.pdf
  http://wiki.lustre.org/images/4/40/Wednesday_shpc-2009-benchmarking.pdf

In particular the Dell UK HPC people are doing valuable work with
Lustre, and their findings are more generally applicable than
their kit (which BTW I quite like).