[Lustre-discuss] Swap over lustre

Wed Aug 17 20:54:57 PDT 2011

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
<landman at scalableinformatics.com> wrote:
> On 08/17/2011 10:43 PM, John Hanks wrote:
> As a rule of thumb, you should try to keep the path to swap as simple as
> possible.  No memory/buffer allocations on the way to a paging event if
> you can possibly do this.

I do have a long path there, will try simplifying that and see if it helps.

> The lustre client (and most NFS or even network block devices) all do
> memory allocation of buffers ... which is anathema to migrating pages
> out to disk.  You can easily wind up in a "death spiral" race condition
> (and it sounds like you are there).  You might be able to do something
> with iSCSI or SRP (though these also do block allocations and could
> trigger death spirals).  If you can limit the number of buffers they
> allocate, and then force them to allocate the buffers at startup (by
> forcing some activity to the block device, and then pin this memory so
> that they can't be ejected ...) you might have chance to do it as a
> block device.  I think SRP can do this, not sure if iSCSI initiators can
> pin buffers in ram.
>
> You might look at the swapz patches (we haven't integrated them into our
> kernel yet, but have been looking at it) to compress swap pages and
> store them ... in ram.  This may not work for you, but it could be an
> option.

I wasn't aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.

> Is there any particular reason you can't use a local drive for this
> (such as you don't have local drives, or they aren't big/fast enough)?

We're doing this on diskless nodes. I'm not looking to get a huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don't expect it to get heavily used, I'm
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I've never tried it with either
protocol.

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

jbh