[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Fri Apr 3 11:27:16 PDT 2009

On Apr 2, 2009, at 6:43 PM, Andreas Dilger wrote:

> On Mar 31, 2009  23:55 -0400, Michael Booth wrote:
>> On Mar 31, 2009, at 11:35 PM, di wang wrote:
>>> Andreas Dilger wrote:
>>>> If each compute timestep takes 0.1s during IO vs 0.01s without IO  
>>>> and
>>>> you would get 990 timesteps during the write flush in the second  
>>>> case
>>>> until the cache was cleared, vs. none in the first case.  I suspect
>>>> that the overhead of the MPI communication on the Lustre IO is  
>>>> small,
>>>> since the IO will be limited by the OST network and disk bandwidth,
>>>> which is generally a small fraction of the cross-sectional  
>>>> bandwidth.
>>>>
>>>> This could be tested fairly easily with a real application that is
>>>> doing computation between IO, instead of a benchmark that is only
>>>> doing
>>>> IO or only sleeping between IO, simply by increasing the per-OSC  
>>>> write
>>>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to  
>>>> avoid
>>>> the
>>>> case where 2 processes on the same node are writing to the same  
>>>> OST).
>>>> Then, measure the time taken for the application to do, say, 1M
>>>> timesteps
>>>> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>>>
>>> Can we implement aio here? for  example  the  aio buffer can be  
>>> treated
>>> different as other dirty buffer, not
>>> being pushed aggressively to server. It seems with buffer_write, the
>>> user have to deal with fs buffer cache
>>> issue in his application, not sure it is good for them, and we may  
>>> not
>>> even output these features to the
>>> application.
>
> I'm not sure what you mean.  Implementing AIO is _more_ complex for  
> the
> application, and in essence the current IO is mostly async except when
> the client hits the max dirty limit.  The client will still flush the
> dirty data in the background (despite Michaels experiment), it just  
> takes
> the VM some time to catch up.

even after an fsync?

The experiment was to see if the dirty cache is in fact being voided  
as designed.

for (writesize = small;writesize<hundeds of megabytes;writesize 
+=increment){
  for (sixty iterations){
   write<---- writesize
  timer1
  fsync
  timer2}}

Run again, but sleep

for (writesize = small;writesize<hundeds of megabytes;writesize 
+=increment){
  for (sixty iterations){
   write<---- writesize
  sleep(1 second)
  timer1
  fsync
timer2}}

for each iteration, take the best time for each fsync, and plot the  
speedup that the second routine
has for it's fsync over the non-slept fsync.   The results

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedGraphic.pdf
Type: application/pdf
Size: 26637 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090403/39d8a9fc/attachment.pdf>
-------------- next part --------------

To me this is not behavior that is consistent with the design  
behavior.  Are there tests for writes to assure that the cache is  
behaving as designed?

another way to view this is to calculate the amount of data that is  
moved during the 1 second of sleep.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pastedGraphic.pdf
Type: application/pdf
Size: 27283 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090403/39d8a9fc/attachment-0001.pdf>
-------------- next part --------------

This seems to show a "bug" in how much dirty data is shipped off to  
disk.   Again,, I don't think that cache is clearing like it was  
designed.

>
> Linux VM /proc tunables can be tweaked on the client to have it be  
> more
> aggressive about pushing out dirty data.  I suspect they are currently
> tuned for desktop workloads more than IO-intensive workloads.

I would not characterize these as i/o intensive,, they are large bulk  
sequential writes,, the cache design is for good performance assuming  
most of what is written is also read right away,, yes this is good for  
desktops, but really in the way for checkpoint restart and data  
dumping Scientific applications.

>
> On Mar 31, 2009  23:55 -0400, Michael Booth wrote:
>> The large size of the I/O request put onto the SeaStar by
>> the Lustre client is giving it an artificially high priority.   
>> Barriers
>> are just a few bytes, the I/Os from the client are in megabytes.
>> SeaStar has no priority in is queue, but  the amount of time it  
>> takes to
>> clear megabyte request results in a priority that is thousands of  
>> times
>> more impact on the hardware than the small synchronization requests  
>> of
>> many collectives.  I am wondering if the interference from I/O to
>> computation is more an artifact of message size and bursts,  than of
>> congestion or routing inefficiencies in seastar..
>>
>> If there are hundreds of megabytes of request queued up on the  
>> network,
>> and there is no priority way to push a barrier or other small mpi  
>> request
>> up on the queue, it is bound to create a disruption.
>
> Note that the Lustre IO REQUESTS are not very large in themselves  
> (under
> 512 bytes for a 1MB write), but it is the bulk xfer that is large.   
> The
> LND code could definitely cooperate with the network hardware to  
> ensure
> that small requests get a decent share of the network bandwidth, and
> Lustre itself would also benefit from this (allowing e.g. lock  
> requests
> to bypass the bulk IO traffic) but whether the network hardware can do
> this in any manner is a separate question.

Some codes see 80% slowdown after writes due to collectives that are  
chocked out by background i/o.

>
>> To borrow the elevator metaphor from Eric,  if all the elevators are
>> queued up from 8:00 to 9:00 delivering office supplies on carts that
>> occupy the entire elevator, maybe the carts should be smaller, and
>> limited to a few per elevator trip.
>
> A better analogy would be the elevators occasionally have a mail cart
> with tens or hundreds of requests for office supplies, and this can
> easily share the elevator with other workers.  Having a separate
> freight elevator to handle the supplies themselves is one way to do
> it, having the elevators alternate people and supplies is another,
> but cutting desks into small pieces so they can share space with
> people is not an option.

I don't want to cut the desks up,, just want to be sure that small  
collective messages are not slowed down orders of magnitude by what  
should be a background process.

>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Mike