[Lustre-discuss] Reads Starved

Peter Grandi pg_mh at mh.to.sabi.co.UK
Thu Jun 2 13:09:40 PDT 2011

> If I use about 10 clients, then the write performance of my
> system is 2 GB/s, and the read performance of my system is
> also 2 GB/s.  These are the results when I run either reads or
> writes, but not both at the same time.

Unfortunately these numbers are meaningless without an idea of
the storage system and the access patterns.
> But, when I have 10 clients doing reads, and a different 10
> clients doing writes, the write performance barely drops, but
> the read performance drops to about 150 MB/s.

This may be entirely the right thing for it to happen.

> I am using the deadline scheduler.

That's a good choice in many cases to avoid starvation problems
with CFQ.

> Can anyone suggest why this is occurring, and how to address
> it?

Well depending on storage system and access patterns, the right
figure might be 150MB/s and wrong figure the 2GB/s. You are not
even saying whether the 10 and 10 are reading/writing to the
same file(s) or different ones (which might matter a great deal
for MDS/OSS/client inode sync), or how many client systems those
threads run on.

Anyhow, it could be because of Lustre caching writes but not
reads (usually) which often results in writes being faster than
reads (but usually by less than that) your storage system does
not have the IOPS to do better, flusher issues, disk host
adapter issues (lots have buggy elevators or caching policies),
too much multithreading. it could be that the prefetcher of
(awful) Linux page cache needs huge read-aheads to work well for
sequential loads.

It could be the clients caching instead, or limitations in the
client networking, or in the switches you are using or their

Run 'iostat -xd 1' on the OSSes to get a better idea of what is
going on, and also ideally on the MDS. Also run on OSSes and

  watch -n1 cat /proc/meminfo

and observe how 'Dirty' and 'Writeback' go. Try also to run
concurrent send/receive (using 2 'nuttcp' instances) between
clients and servers to check the networking is simulataneous.

Some network chipsets (cheaper ones) can only process N packets
per seconds, where N is less than what is necessary to
simultaneously run both tranmitter and receiver at full speed
(as a rule the transmitter wins, so if you got those in your
OSSes bad news for reads).

> In my idealized world, the read and write performance should
> be proportional to the initial read and write performance, and
> the ratio of the number of read and write clients.

That's movingly innocent and optimistic.

Storage and system performance is extremely anisotropic,
especially with Lustre and large storage systems in general.

More information about the lustre-discuss mailing list