[Lustre-devel] Oleg/Mike Work on Apps Metrics - FW: Mike Booth week ending 2009.03.15

Wed Mar 18 13:31:50 PDT 2009

Hello!

On Mar 16, 2009, at 8:56 AM, Eric Barton wrote:
> I _do_ agree that for some apps, if there was sufficient memory on the
> app node to buffer the local component of a checkpoint and let it
> "dribble" out to disk would achieve better utilization of the compute
> resource.  However parallel apps can be very sensitive to "noise" on
> the network they're using for inter- process communication - i.e. the
> checkpoint data has either to be written all the way to disk, or at
> least buffered somewhere so that moving it to disk will not interfere
> with the app's own communications.
> This latter concept is the basis for the "flash cache" concept.
> Actually, I think it's worth exploring the economics of it in more
> detail.

This turns out to be a very true assertion. We (I) do see a huge delay
in e.g. MPI barriers done immediately after write.

> The variables are aggregate network bandwidth into the distributed
> checkpoint cache, which determines the checkpoint time, and aggregate
> path-minimum bandwidth (i.e. lesser of network and disk bandwidth)
> from the cache to disk, which determines how soon the cache can be
> ready for the next checkpoint.  The cache could be dedicated nodes and
> storage (e.g. flash) or additional storage on the OSSes, or any
> combination of either.  And the interesting relationship is how
> compute cluster utilisation varies with the cost of the server and
> cache subsystems.

The thing is, if we can just flush out data from the cache at the moment
when there is no network-latency critical activity on the app side  
(somehow
signaled by the app), why would we need the flash storage at all? We can
write nice sequential chunks to normal disks just as fast, I presume.
It is the random i/o patterns that make flash shine.

Bye,
     Oleg