[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

Wed Aug 5 10:30:22 PDT 2009

Hi Andreas -

Thanks for the advice.
I will gather additional CPU stats and see what shows up.

However, CPU does not seem to be a factor
in the slower than expected large file buffered I/O reads.

My machines have dual quad 2.66ghz processors,
and gross CPU usage hovers around 50%
when I'm running 16 "dd" read jobs.

But a suspected client-side caching problem
crops up
when I just run a simple "dd" read job twice.

The first time I run the single "dd" read job
I get an expected throughput of 60-megabytes-per-second or so.
However,
the second time I run the job,
I get a throughput of about 2-gigabytes-per-second,
which is
twice the top speed of my 10gb NIC,
and only possible, I think,
if the entire file was cached on the client
when the first "dd" job was run.

So, if I run 16 "dd" jobs,
each trying to cache entire large files on the client,
that could explain the unexpected slow aggregate throughput

I would have thought that setting a low value for "max_cached_mb"
would have solved this problem,
but it made no difference.

A further indication that client-side caching is at the root of speed
slowdown,
is that
when I run my single "dd" job twice;
but I drop client-side cache after the first run,
(via "/proc/sys/vm/drop_caches"),
I get an expected 60-megabytes-per-second or so throughput for both runs.

Until I learn how to overcome this slowdown problem,
I'll see if if I can obtain my required
concurrent, multi large file read speed
by carefully striping the files over a few boxes.

Again,
thanks for your help,
and
I'll appreciate any other suggestions you might have,
or
any ideas for other diagnostics we might run.

Rick

On 8/4/09, Andreas Dilger <adilger at sun.com> wrote:
>
> On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:
> > I'm new to Lustre (v1.8.0.1), and I've verified that
> > I can get about 1000-megabytes-per-second aggregate throughput
> > for large file sequential reads using direct-I/O.
> > (only limited by the speed of my 10gb NIC with TCP offload engine).
> >
> > the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second
> > aggregate throughput, but when I try the same tests with normal buffered
> > I/O, (by just running "dd" without "iflag=direct"), the runs
> > only get about a 550-megabyte-per-second aggregate throughput.
> >
> > I suspect that this slowdown may have something to do with
> > client-side-caching, but normal buffered reads have not speeded up,
> > even after I've tried such adjustments as:
>
> Note that there is a significant CPU overhead on the client when using
> buffered IO, simply due to CPU usage from copying the data between
> userspace and the kernel.  Having multiple cores on the client (one
> per dd process) allows distributing this copy overhead between cores.
>
> You could also run "oprofile" to see if there is anything else of
> interest that is consuming a lot of CPU.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20090805/8e8612e2/attachment.htm>