[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Tue Mar 31 22:08:10 PDT 2009

Yes. But I do need to finish the assignment that Scott Klasky has for  
me first.

Mike

Mike Booth
512 289-3805 mobile
512 692-9602

On Apr 1, 2009, at 1:01 AM, Eric Barton <eeb at sun.com> wrote:

> I'd really like to see measurements that confirm that allowing the
> checkpoint I/O to overlap the application compute phase really
> delivers the estimated benefits.  My concern is that Lustre and
> application communications will interfere to the detriment of both and
> end up being less efficient overall.
>
> Mike, can you find 2 apps, one which is communications-intensive and
> another that is only CPU-bound immediately after a checkpoint and  
> measure
> them?
>
>    Cheers,
>              Eric
>
>> -----Original Message-----
>> From: Andreas.Dilger at Sun.COM [mailto:Andreas.Dilger at Sun.COM] On  
>> Behalf Of Andreas Dilger
>> Sent: 31 March 2009 7:51 PM
>> To: Oleg Drokin
>> Cc: Eric Barton; lustre-devel at lists.lustre.org; Michael.Booth at Sun.COM
>> Subject: Re: [Lustre-devel] Oleg/Mike Work on Apps Metrics - FW:  
>> Mike Booth week ending 2009.03.15
>>
>> On Mar 18, 2009  16:31 -0400, Oleg Drokin wrote:
>>> On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:
>>>> I _do_ agree that for some apps, if there was sufficient memory  
>>>> on the
>>>> app node to buffer the local component of a checkpoint and let it
>>>> "dribble" out to disk would achieve better utilization of the  
>>>> compute
>>>> resource.  However parallel apps can be very sensitive to "noise"  
>>>> on
>>>> the network they're using for inter- process communication - i.e.  
>>>> the
>>>> checkpoint data has either to be written all the way to disk, or at
>>>> least buffered somewhere so that moving it to disk will not  
>>>> interfere
>>>> with the app's own communications.
>>>> This latter concept is the basis for the "flash cache" concept.
>>>> Actually, I think it's worth exploring the economics of it in more
>>>> detail.
>>>
>>> This turns out to be a very true assertion. We (I) do see a huge  
>>> delay
>>> in e.g. MPI barriers done immediately after write.
>>
>> While this is true, I still believe that the amount of delay seen by
>> the application cannot possibly be worse than waiting for all of the
>> IO to complete.  Also, the question is whether you are measuring the
>> FIRST MPI barrier after the write, vs e.g. the SECOND MPI barrier
>> after the write?  Since Lustre is currently aggressively flushing the
>> write cache then the first MPI barrier is essentially waiting for all
>> of the IO to complete, which is of course very slow.  The real item
>> of interest is how long the SECOND MPI barrier takes, which is what
>> the overhead of Lustre IO is on the network performance.
>>
>>
>> It is impossible that Lustre IO completely saturates the entire
>> cross-sectional bandwidth of the system OR the client CPUs, so having
>> some amount of computation for "free" during IO is still better than
>> waiting for the IO to complete.
>>
>> For example, we have 1000 processes each doing a 1GB write to their  
>> own
>> file, and the aggregate IO bandwidth is 10GB/s the IO will take  
>> about 100s
>> to write if (as we currently do) limit the amount of dirty data on  
>> each
>> client to avoid interfering with the application, and no  
>> computation can
>> be done during this time.  If we allowed clients to cache that 1GB of
>> IO it might only take 1s to complete the "write" and then 99s to  
>> flush
>> the IO to the OSTs.
>>
>> If each compute timestep takes 0.1s during IO vs 0.01s without IO and
>> you would get 990 timesteps during the write flush in the second case
>> until the cache was cleared, vs. none in the first case.  I suspect
>> that the overhead of the MPI communication on the Lustre IO is small,
>> since the IO will be limited by the OST network and disk bandwidth,
>> which is generally a small fraction of the cross-sectional bandwidth.
>>
>> This could be tested fairly easily with a real application that is
>> doing computation between IO, instead of a benchmark that is only  
>> doing
>> IO or only sleeping between IO, simply by increasing the per-OSC  
>> write
>> cache limit from 32MB to e.g. 1GB in the above case (or 2GB to  
>> avoid the
>> case where 2 processes on the same node are writing to the same OST).
>> Then, measure the time taken for the application to do, say, 1M  
>> timesteps
>> and 100 checkpoints with the 32MB and the 2GB write cache sizes.
>>
>>>> The variables are aggregate network bandwidth into the distributed
>>>> checkpoint cache, which determines the checkpoint time, and  
>>>> aggregate
>>>> path-minimum bandwidth (i.e. lesser of network and disk bandwidth)
>>>> from the cache to disk, which determines how soon the cache can be
>>>> ready for the next checkpoint.  The cache could be dedicated  
>>>> nodes and
>>>> storage (e.g. flash) or additional storage on the OSSes, or any
>>>> combination of either.  And the interesting relationship is how
>>>> compute cluster utilisation varies with the cost of the server and
>>>> cache subsystems.
>>>
>>> The thing is, if we can just flush out data from the cache at the  
>>> moment
>>> when there is no network-latency critical activity on the app side
>>> (somehow signaled by the app), why would we need the flash storage  
>>> at
>>> all? We can write nice sequential chunks to normal disks just as  
>>> fast,
>>> I presume.
>>
>> Actually, an idea I had for clusters that are running many jobs at  
>> once
>> was essentially having the apps poll for IO capacity when doing a
>> checkpoint, so that they can avoid contending with other jobs that  
>> are
>> doing checkpoints at the same time.  That way, an app might skip an
>> occasional checkpoint if the filesystem is busy, and instead compute
>> until the filesystem is less busy.
>>
>> This would be equivalent to the client node being able to cache all  
>> of
>> the write data and flushing it out in the background, so long as the
>> time to flush a single checkpoint never took longer than the time
>> between checkpoints.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>
>