[Lustre-discuss] performance tuning w/ dbench, bonnie++

Jay Christopherson jc.listmail at gmail.com
Mon Dec 21 09:57:04 PST 2009

I have a relatively small Lustre environment.  I'm running 4 OSS's with 4
OST's, running version (I recently upgraded from 1.6).  The MDS is a
separate host with separate disk.  I've been using Lustre to host shared
logs, files, and application queues, most recently, a JMS shared queue.  It
has been working well until we had tuned out Java application and databases
to the point where throughput on Lustre has become the bottleneck.  When we
moved our application queues to local disk, instead of shared disk, our
application throughput really shot through the roof.  At that point, we had
been seeing a lot of "pauses" in IO, where throughput (as measured by our
application) would be running right along and then, periodically, we would
see unexplained pauses, where IO would take nearly 2000ms to complete as
opposed to more normal times of sub 200ms.  This is IO as we are defining in
terms of our application, not simply disk IO.

After eliminating everything else, we moved our application queues and
logging to local disk and throughput stabilized throughout the entire run of
a test at peak loads.  Trying to explain this behavior, we started
benchmarking Lustre vs. local disk.  At first, the times were really bad,
showing something like 15MB/s on Lustre vs. 400MB/s local, as reported by
dbench.  After much tuning, re-architecting our Lustre setup (which
admittedly, was not ideal), we were able to see big improvements, as
evidenced by repeated Bonnie++ tests.  However, no matter what our
configuration, dbench never shows any improvement.  It never shows more than
25MB/sec, with 5 clients.  I *KNOW* we are getting better throughput, but I
need to be able to prove it.  Bonnie++ is showing improvement (like
100+MB/sec), but I'd like to see two or three sources verify it for me
before I go back and start re-testing our application again.

I'm using Bonnie++ and dbench like so:

# bonnie++ -d /logstore/test

# dbench -t 60 -D /logstore/test 5

I'm hoping it's simply a matter of me not using the test correctly or
something else that makes me the culprit.  If there are other tests that I
should be doing, that would be helpful too.  I looked at IOR, which has been
a pain to get running since LAM really, really, really doesn't want to
compile or install correctly on my system (CentOS 5.2, x86_64).
