[Lustre-discuss] tuning for small I/O

Sun Jan 10 14:19:05 PST 2010

The subject of this email "[Lustre-discuss] tuning for small I/O" is
a bit in the category of "tuning jackhammers to cut diamonds".

Lustre has been designed for massive streaming parallel IO, and does
OK-ish for traditional ("home dir") situations. Not necessarily
for shared message databases.

>>> I'm attempting to run a pair of ActiveMQ java instances,

Life will improve I hope :-).

>>> using a shared Lustre filesystem mounted with flock for failover
>>> purposes.

The 'flock' is the key issue here, probably even more than the
"small I/O" issue.

Consider this thread on a very similar topic:

  http://lists.lustre.org/pipermail/lustre-discuss/2008-October/009001.html
   "The other alternative is "-o flock", which is coherent locking
    across all clients, but has a noticable performance impact"

  http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009690.html
   "It is both not very optimized and slower than local system since
    it needs to send network rpcs for locking (Except for the
    localflock which is same speed as for local fs)."

  http://lists.lustre.org/pipermail/lustre-discuss/2004-August/000425.html
   "We faced similar issues when we tried to access/modify a single
    file concurrently from multiple processes (across multiple
    clients) using the MPI-IO interfaces. We faced similar issues
    with other file systems as well, so we resorted to implementing
    our own file/record-locking in the MPI-IO middleware (on top of
    file-systems)."

  http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009679.html

etc.

>>> There's lots of ways to do ActiveMQ failover and shared
>>> filesystem just happens to be the easiest.

Easiest is very tempting, but perhaps not the most effective.  If
you really cared about getting meaningful replies you would have
provided these links BTW:

  http://activemq.apache.org/shared-file-system-master-slave.html
  http://activemq.apache.org/replicated-message-store.html

Doing a bit more searching it turns out that there are several
ways to tune ActiveMQ and this may reduce the number of barrier
operations/committed transactions. Maybe. There seem to be
something vaguely interesting here:

  http://fusesource.com/docs/broker/5.0/persistence/persistence.pdf

Otherwise I'd use the master/slave replication feature, but this
is just an impression.

>>> ActiveMQ, at least the way we are using it, does a lot of small
>>> I/O's, like 600 - 800 IOPS worth of 6K I/O's.

Thats seems pretty reasonable. I guess that is a few hundred/s worth
of journal updates. Problem is, they will be mostly hitting the same
files, thus the need for 'flock' and synchronous updates.

So it matters *very much* how many of those 6K IOs are
transactional, that is involve locking and flushing to disk.
I suspect from your problems asn the later statement "async is
not an option" that each of them is a transaction.

>>> When I attempt to use Lustre as the shared filesystem, I see
>>> major IO wait time on the cpu's, like 40 - 50%.

Why do many people fixate on IO wait? just because it is easy to
see? Bah!

If there is one, what is the performance problem *on the client*
in terms of *client application issues*? That's what matters.

>>> My OSS's and MDS don't seem to be particularly busy

Unsurprisingly. How many OSSes and how many OSTs per OSS, and
how many disks? just curiosity, it is not that important.

>>> being 90% idle or more while this is running.

Ideally they would be 100% idle :-).

>>> If I remove Lustre from the equation and simply write to local
>>> disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances
>>> don't seem to have any problems.

And which problems do you have when running with Lustre? You haven't
said. "major IO wait" and "90% idle" are not problems, they are
statistics, and they could mean something else.

>>> The disk that is backing the OSS's are all SAS 15K disks in a
>>> RAID1 config.

RAID1 is nice, but how many? That would be a very important detail.

>>> 1. What should I be looking at to tune my Lustre FS for this
>>> type of IO?

Not really. It is both a storage system problem and a network
protocol problem.

The "small I/O" problem is the least of the two, the real problem is
that you have "small I/O on a shared filesystem with distributed
interlocked updates to the same files", that is network protocol
problem.

The network protocol problem is very very difficult, because the
server needs to synchronize two clients, and present a current image
of the files to the two clients; that is when one client does an
update, the other client must be able to see it "immediately", which
is not easy. For example I have heard reports that when writing from
a client to a Lustre server, sometimes (in a small percentage of
cases) another client only sees the update dozens of seconds later
for example (but your use of locking may help with that). I wonder
if locking is enabled and used on that system BTW.

>>> [ ... ] I have also ensured that striping is disabled (lfs
>>> setstripe -d) on the shared directory.

Unless your files are real big that does not matter. Uhm, the
message store seems to actually use a few biggish (32MB?) journal
files plus (perhaps smaller) indices:

  http://activemq.apache.org/persistence.html
  http://activemq.apache.org/kahadb.html
  http://activemq.apache.org/amq-message-store.html
  http://activemq.apache.org/should-i-use-transactions.html
  http://activemq.apache.org/how-lightweight-is-sending-a-message.html

So perhaps the striping does have an effect.

>>> I guess I am just not experienced enough yet with Lustre to know
>>> how to track down and resolve this issue. I would think Lustre
>>> should be able to handle this load, but I must be missing
>>> something.

Sure it is able to handle that load -- it does, at great effort and
going the against the grain of what Lustre has been designed for.

The basic problem is that instead of using a low latency distributed
communication system for interlocked updating of the message store,
you are attempting to use the implicit one in a filesystem because
"shared filesystem just happens to be the easiest", even if they are
not meant to give you a high transaction rate with low latency for a
shared database.

In ordinary shared filesystems locking is provided for coarse
protection and IO is expected to be on fairly coarse granularity
too, and more so for Lustre and other cluster systems.

>>> For the record, NFS was not able to handle this load either, at
>>> least with default export settings

It is very very difficult to handle that workload acrowss multiple
clients in a distributed filesystem. Then NFSv3 or older have their
own additional issues.

>>> (async was improved, but async is not an option).

If 'async' is not an option you have a big problem in general as
hinted above.

Also, but not very related here, the NFS client for Linux has some
significant performance problems with writing. To the point that
sometimes I think that Lustre can be used to replace NFS even when
no clustering is desired (single OSS), simply because its protocol
is better (and there is a point also to LNET).

>> First of all, I would suggest benchmarking your Lustre setup for
>> small file workload.

I may have misunderstood but the original poster nowhere wrote
"small file workload", but "small I/O" which is quite different.

The shared message store he has setup is to be updated concurrently
receiving small I/O transactions, but it is contained in journals of
probably a few dozen MB each.

>> For example, use Bonnie++ in IOPS mode to create small sized
>> files on Lustre. That will tell you limit of Lustre setup. I got
>> about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB)
>> RAID10 setup.

Small sized files and creates/sec seem not what the original posted
is worries, even if 1000 metadata operations/s  per disk pairs seems
nice indeed.

>> Try disabling Lustre debug messages on all clients: sysctl -w
>> lnet.debug=0

That may help, I hadn't thought of that.

>> Try increasing dirty cache on client nodes: lctl set_param
>> osc.*.max_dirty_mb=256 Also, you can bump up max rpcs in flight
>> from 8 to 32 but given that you have gigabit ethernet network, I
>> don't think it will improve performance.

That can be counterproductive as the problem seems to be concurrent
interlocked updates from multiple clients to the persistent database
of a shared queue system. As clear from the point "async is not an
option").

> Also, the Lustre manual includes a section on improving performance when 
> working with small files:
> http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398

The real problem as hinted above is that interlocking is of the
essence in the application above, for a message store used by many
distributed clients, and the message store is not made up of small
files.

However, there is an interesting point there about precisely the
type of issue I was alluding above about interlocking:

  "By default, Lustre enforces POSIX coherency semantics, so it
  results in lock ping-pong between client nodes if they are all
  writing to the same file at one time."

Perhaps the other advice may be also relevant:

 "Add more disks or use SSD disks for the OSTs. This dramatically
  improves the IOPS rate."

but I think it is mostly a locking latency issue. If it were a small
transaction issue, a barrier every 6K may not work well with SSDs
which have an erase page size usually around 32KiB.

Good luck ;-).