[Lustre-discuss] Swap over lustre

Temple Jason jtemple at cscs.ch
Wed Aug 17 23:36:56 PDT 2011


I experimented with swap on lustre in as many ways as possible (without touching the code), and had the shortest path possible to no avail.  The code is not able to handle it at all, and the system always hung.

Without serious code rewrites, this isn't going to work for you.


-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of John Hanks
Sent: giovedì, 18. agosto 2011 05:55
To: landman at scalableinformatics.com
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Swap over lustre

On Wed, Aug 17, 2011 at 8:57 PM, Joe Landman
<landman at scalableinformatics.com> wrote:
> On 08/17/2011 10:43 PM, John Hanks wrote:
> As a rule of thumb, you should try to keep the path to swap as simple as
> possible.  No memory/buffer allocations on the way to a paging event if
> you can possibly do this.

I do have a long path there, will try simplifying that and see if it helps.

> The lustre client (and most NFS or even network block devices) all do
> memory allocation of buffers ... which is anathema to migrating pages
> out to disk.  You can easily wind up in a "death spiral" race condition
> (and it sounds like you are there).  You might be able to do something
> with iSCSI or SRP (though these also do block allocations and could
> trigger death spirals).  If you can limit the number of buffers they
> allocate, and then force them to allocate the buffers at startup (by
> forcing some activity to the block device, and then pin this memory so
> that they can't be ejected ...) you might have chance to do it as a
> block device.  I think SRP can do this, not sure if iSCSI initiators can
> pin buffers in ram.
> You might look at the swapz patches (we haven't integrated them into our
> kernel yet, but have been looking at it) to compress swap pages and
> store them ... in ram.  This may not work for you, but it could be an
> option.

I wasn't aware of swapz, that sounds really interesting. The codes
that run the nodes out of memory tend to be sequencing applications,
which seem like good candidates for memory compression.

> Is there any particular reason you can't use a local drive for this
> (such as you don't have local drives, or they aren't big/fast enough)?

We're doing this on diskless nodes. I'm not looking to get a huge
amount of swap, just enough to provide a place for the root filesystem
to page out of the tmpfs so we can squeeze out all the RAM possible
for applications. Since I don't expect it to get heavily used, I'm
considering running vblade on a server and carving out small aoe LUNs.
It seems logical that if a host can boot off of iscsi or aoe, that you
could have a swap space there but I've never tried it with either

FWIW, mounting a file on lustre via loopback to provide a local
scratch filesystem works really well.

Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org

More information about the lustre-discuss mailing list