[Lustre-discuss] tuning for small I/O

Peter Grandi pg_lus at lus.for.sabi.co.UK
Sat Jan 16 09:58:51 PST 2010


I have received some offline updates about this story:

>>> I'm attempting to run a pair of ActiveMQ java instances,
>>> using a shared Lustre filesystem mounted with flock for
>>> failover purposes.

> The 'flock' is the key issue here, probably even more than the
> "small I/O" issue. [ ... ]

>>> [ ... ] ActiveMQ, at least the way we are using it, does a
>>> lot of small I/O's, like 600 - 800 IOPS worth of 6K I/O's.

> Thats seems pretty reasonable. I guess that is a few hundred/s
> worth of journal updates. Problem is, they will be mostly
> hitting the same files, thus the need for 'flock' and
> synchronous updates. So it matters *very much* how many of
> those 6K IOs are transactional, that is involve locking and
> flushing to disk. I suspect from your problems asn the later
> statement "async is not an option" that each of them is a
> transaction.

>>> The disk that is backing the OSS's are all SAS 15K disks in
>>> a RAID1 config.

> RAID1 is nice, but how many? That would be a very important
> detail.

This apparently is a 14-drive RAID10 (hopefully a true RAID10
7x(1+1) rather than the RAID01 7+7 mentioned offline).

That means a total rate of perhaps 100-120 6K transactions per
disk, if lucky (that depends on the number of log files and
their spread).

The total data rate over Lustre is around 5MB/s, and even with
just 6K per operations Lustre should be doing that, even if I
suspect that the achievable 'flock' rate depends more on the MDS
storage system than the OSS one.

If every write is a transaction, and (hopefully) ActiveMQ
requests committing to stable storage every transaction, then it
is both an 'flock' and 'fsync' problem.

Then depending on the size of the queue I'd also look, if not
already done, at using host adapters with a quite large battery
back buffer/cache for both the MDS and the OSSes, as latency may
be because of waiting for uncached writes. Sure, the setup seems
to work fast enough when the disks are local, already, which may
mean over-wire latencies add too much, but reducing the storage
system latency may help, even if not needed in the local case.

That is purely a storage layer issue (for both MDTs and OSTs),
and nothing to do with Lustre itself, while the 'flock' issue
(and flushing from the *clients*) has to do with Lustre (even if
it *may* too be alleviated with very low latency battery backed
buffers/caches).

Again, interlocked stable ('flock'/'fsync') storage operations
between two clients via a third server are difficult to make
fast, because of latency and flushing issues, in the context of
remote file access, either general purpose like NFS or parallel
bulk streaming like Lustre.



More information about the lustre-discuss mailing list