[Lustre-devel] Fast checkpoints in Lustre today, at essentially zero cost.
Oleg Drokin
Oleg.Drokin at Sun.COM
Wed Mar 4 09:14:17 PST 2009
Hello!
I was having a discussion with Mike Booth just now about how
application programmers are willing to sacrifice robustness of their
checkpoint files and go the the previous version if the last one
did not made it to the disk if only the checkpointing itself could
be done quick (even if not persistent).
I also remembered how we discussed using of ramdisks on the
compute nodes and then have some background agent to actually copy the
data to
Lustre later on when the application is happily computing still.
It now suddenly appeared to me (duh!) that we can get that all for
free with buffer cache right now. We are artificially limiting our
dirty memory to
32M/osc on every node (which is probably way to low today too),
but if we lift the limit significantly (or remove it altogether) such
checkpointing applications (and I verified that many of them use
10-20% of RAM for checkpointing) would benefit tremendously
(as long as they do not do fsync at the end of checkpoint, of
course). Currently there is a way to achieve this same goal, but to do
this
we essentially need to select a very suboptimal stripe pattern
(essentially drag your file to be striped across as many OSTs as
possible with small stripe size to
maximize allowed dirty memory cache in use at the expense of a lot
of seeking at the OSTs since that would essentially mean every client
would be
writing to every OST).
The old justification for the small dirty memory limit was to do
with lock timeouts and stuff, but now that we have lock prolonging and
indefinite waiting for
locks on clients on the other, there is no reason to limit
ourselves anymore, I think.
I plan to speak with ORNL later today to conduct an experiment (if
they would agree) and to lift their dirty memory limit on some of the
scratch filesystems
often used for scratch files to see what the effect would be. I
expect it to be very positive myself.
At the same time Mike is doing a test right now with some real
world applications + the above-mentioned suboptimal striping pattern
to see the effects as well.
I think that allows us to take advantage of huge amounts of wasted
memory on compute nodes today for this caching and benefit many
application with this
checkpointing model (essentially superseding old "flash-cache"
idea at a fraction of the cost and effort).
Bye,
Oleg
More information about the lustre-devel
mailing list