[Lustre-discuss] Lustre v1.8.0.1 slower than expected large-file, sequential-buffered-file-read speed

Wed Aug 5 17:25:07 PDT 2009

On Aug 05, 2009  13:30 -0400, Rick Rothstein wrote:
> My machines have dual quad 2.66ghz processors,
> and gross CPU usage hovers around 50%
> when I'm running 16 "dd" read jobs.

Be cautious of nice round numbers for CPU usage.  Sometimes this
means that 1 CPU is 100% busy, and another is 0% busy.  With 16
tasks on a 8-core system you are going to get some kind of CPU
contention, but whether it is too much is hard to say without
digging much more deeply into the code.

> But a suspected client-side caching problem
> crops up when I just run a simple "dd" read job twice.
> 
> The first time I run the single "dd" read job
> I get an expected throughput of 60-megabytes-per-second or so.
> However, the second time I run the job, I get a throughput of
> about 2-gigabytes-per-second, which is twice the top speed of
> my 10gb NIC, and only possible, I think, if the entire file was
> cached on the client when the first "dd" job was run.

That is a feature.  Lustre is cache coherent, so the fact that
the whole file can be read from cache on the client with no network
IO is totally safe.  The fact that the second read is much faster
does not, in itself, indicate any problem.

> So, if I run 16 "dd" jobs, each trying to cache entire large files
> on the client, that could explain the unexpected slow aggregate throughput
> 
> A further indication that client-side caching is at the root of speed
> slowdown, is that when I run my single "dd" job twice; but I drop
> client-side cache after the first run, (via "/proc/sys/vm/drop_caches"),
> I get an expected 60-megabytes-per-second or so throughput for both runs.

Well, that isn't surprising, but it doesn't necessarily indicate why
the reads are going _slower_ than without any cache.  It is of course
very possible that there is some kind of lock contention on the client,
but I thought this had been fixed in the 1.8 release (bug 11817).

Note that using O_DIRECT will of course bypass caching, which is still
desirable if you know you are not re-using the data.

> Until I learn how to overcome this slowdown problem,
> I'll see if if I can obtain my required
> concurrent, multi large file read speed
> by carefully striping the files over a few boxes.

I would run tests with 1,2,4,6,8,12,16 processes, and see what the
per-task performance is.  Examining oprofile data per run will tell
you what functions become more heavily used when there are more
tasks involved.

You might also consider to look at the client rpc_stats, or the
(corresponding) server brw_stats to see if the read RPCs become
badly formed with many threads.  Also the client read_ahead_stats
would help tell you if the readahead is going badly with many threads.

> 
> Again,
> thanks for your help,
> and
> I'll appreciate any other suggestions you might have,
> or
> any ideas for other diagnostics we might run.
> 
> Rick
> 
> On 8/4/09, Andreas Dilger <adilger at sun.com> wrote:
> >
> > On Aug 04, 2009  10:30 -0400, Rick Rothstein wrote:
> > > I'm new to Lustre (v1.8.0.1), and I've verified that
> > > I can get about 1000-megabytes-per-second aggregate throughput
> > > for large file sequential reads using direct-I/O.
> > > (only limited by the speed of my 10gb NIC with TCP offload engine).
> > >
> > > the above direct-I/O "dd" tests achieve about a 1000-megabyte-per-second
> > > aggregate throughput, but when I try the same tests with normal buffered
> > > I/O, (by just running "dd" without "iflag=direct"), the runs
> > > only get about a 550-megabyte-per-second aggregate throughput.
> > >
> > > I suspect that this slowdown may have something to do with
> > > client-side-caching, but normal buffered reads have not speeded up,
> > > even after I've tried such adjustments as:
> >
> > Note that there is a significant CPU overhead on the client when using
> > buffered IO, simply due to CPU usage from copying the data between
> > userspace and the kernel.  Having multiple cores on the client (one
> > per dd process) allows distributing this copy overhead between cores.
> >
> > You could also run "oprofile" to see if there is anything else of
> > interest that is consuming a lot of CPU.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> >

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.