[Lustre-discuss] Fwd: Lustre and Large Pages

Andreas Dilger andreas.dilger at oracle.com
Fri Aug 20 11:31:21 PDT 2010


On 2010-08-20, at 07:21, John Hammond wrote:
> Indeed, thanks.  On Ranger, the compute nodes use compact flash drives for /, and so they depend on tmpfs's for /tmp, /var/run, /var/log, and of course /dev/shm.  So cleaning up these ram backed filesystems as much as practical before asking for any hugepages is also a win.
> 
> Also, in imitation of the systems that pre-allocate all needed hugepages at boot time, we are considering the idea of first pre-allocating a large chunk of memory (say 7/8) in hugepages, then mounting the Lustre filesystems, then releasing the hugepages.  The hope is that Lustre's persistent structures will be fit into a more compact region of memory thereby.

As discussed in https://bugzilla.lustre.org/show_bug.cgi?id=14323 that I previously referenced, the Lustre tunables are based on the total number of pages, and do not take huge pages into account.

Also, if the hugepages are released, there is no guarantee that you will be able to allocate them all again due to small pinned memory structures _somewhere_ in the middle of each huge page.

If you are running an prologue/epilogue script then you should tune the Lustre cache size based on the number of huge pages that will be used.  The last time this was investigated, there was no way for Lustre to know how many huge pages were allocated from within the kernel w/o patching it.  If that has changed in newer kernels, it would be possible to dynamically adjust the cache size based on this.


> The main obstacle in testing all of this is that benchmarking the gains gotten by a particular approach is difficult, since we have not yet found an easy way of producing external fragmentation of physical memory in short order.  Suggestions are welcome.

Running something like "slocate" across multiple filesystems will fill all of RAM with inodes/dentries, and if you pin some of these in memory (e.g. start a shell with some deep directory as CWD), you should quickly be able to fragment your memory with unfreeable inode/dentry allocations.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.




More information about the lustre-discuss mailing list