<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">


<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="generator" content="HTML Tidy for Windows (vers 25 March 2009), see www.w3.org">


<meta name="Generator" content="MS Exchange Server version 14.02.0247.001">


<title>Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.</title>


</head>


<body>


Watch for swapping now. Turning zone reclaim on can cause the machine to swap if the memory use goes outside of the NUMA node.<br>


<br>


Although you don't have much memory(which IMHO is the real issue)<br>


so this may not effect you.<br>


<br>


-----Original Message-----<br>


<b>From: </b>Dragseth Roy Einar [<a href="mailto:roy.dragseth@uit.no">roy.dragseth@uit.no</a>]<br>


<b>Sent: </b>Friday, August 23, 2013 03:09 PM Central Standard Time<br>


<b>To: </b>lustre-discuss@lists.lustre.org<br>


<b>Subject: </b>Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.<br>


<br>


<!-- Converted from text/plain format -->


<p><font size="2">Thanks for the suggestion!  It didn't help, but as I read the documentation on<br>


vfs_cache_pressure in the kernel docs I noticed the next parameter,<br>


zone_reclaim_mode, which looked like it might be worth fiddling with.  And what<br>


do you know, changing it from 0 to 1 made the system overhead vanish<br>


immediately!<br>


<br>


I must admit I do not completely understand why this helps, but it seems to do<br>


the trick in my case.  We'll put<br>


<br>


vm.zone_reclaim_mode = 1<br>


<br>


into /etc/sysctl.conf from now on.<br>


<br>


Thanks to all for the hints and comments on this.<br>


<br>


A nice weekend to everyone, mine for sure is going to be...<br>


r.<br>


<br>


<br>


On Friday 23. August 2013 09.36.34 Scott Nolin wrote:<br>


> You might also try increasing the vfs_cache_pressure.<br>


><br>


> This will reclaim inode and dentry caches faster. Maybe that's the<br>


> problem, not page caches.<br>


><br>


> To be clear - I have no deep insight into Lustre's use of the client<br>


> cache, but you said you has lots of small files, which if lustre uses<br>


> the cache system like other filesystems means it may be inodes/dentries.<br>


> Filling up the page cache with files like you did in your other tests<br>


> wouldn't have the same effect. Just my guess here.<br>


><br>


> We had some experience years ago with the opposite sort of problem. We<br>


> have a big ftp server, and we want to *keep* inode/dentry data in the<br>


> linux cache, as there are often stupid numbers of files in directories.<br>


> Files were always flowing through the server, so the page cache would<br>


> force out the inode cache. Was surprised to find with linux there's no<br>


> ability to set a fixed inode cache size - the best you can do is<br>


> "suggest" with the cache pressure tunable.<br>


><br>


> Scott<br>


><br>


> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:<br>


> > I tried to change swapiness from 0 to 95 but it did not have any impact on<br>


> > the system overhead.<br>


> ><br>


> > r.<br>


> ><br>


> > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:<br>


> >> No, I cannot detect any swap activity on the system.<br>


> >><br>


> >> r.<br>


> >><br>


> >> On Thursday 22. August 2013 09.21.33 you wrote:<br>


> >>> Is this slowdown due to increased swap activity?  If "yes", then try<br>


> >>> lowering the "swappiness" value.  This will sacrifice buffer cache space<br>


> >>> to<br>


> >>> lower swap activity.<br>


> >>><br>


> >>> Take a look at <a href="http://en.wikipedia.org/wiki/Swappiness">http://en.wikipedia.org/wiki/Swappiness</a>.<br>


> >>><br>


> >>> Roger S.<br>


> >>><br>


> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:<br>


> >>>> We have just discovered that a large buffer cache generated from<br>


> >>>> traversing a lustre file system will cause a significant system<br>


> >>>> overhead<br>


> >>>> for applications with high memory demands.  We have seen a 50% slowdown<br>


> >>>> or worse for applications.  Even High Performance Linpack, that have no<br>


> >>>> file IO whatsoever is affected.  The only remedy seems to be to empty<br>


> >>>> the<br>


> >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches"<br>


> >>>><br>


> >>>> Any hints on how to improve the situation is greatly appreciated.<br>


> >>>><br>


> >>>><br>


> >>>> System setup:<br>


> >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband<br>


> >>>> connection<br>


> >>>> to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64<br>


> >>>> and<br>


> >>>> lustre v2.1.6 rpms downloaded from whamcloud download site.<br>


> >>>><br>


> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud<br>


> >>>> site).<br>


> >>>> Each OSS has 12 OST, total 1.1 PB storage.<br>


> >>>><br>


> >>>> How to reproduce:<br>


> >>>><br>


> >>>> Traverse the lustre file system until the buffer cache is large enough.<br>


> >>>> In our case we run<br>


> >>>><br>


> >>>>    find . -print0 -type f | xargs -0 cat > /dev/null<br>


> >>>><br>


> >>>> on the client until the buffer cache reaches ~15-20GB.  (The lustre<br>


> >>>> file<br>


> >>>> system has lots of small files so this takes up to an hour.)<br>


> >>>><br>


> >>>> Kill the find process and start a single node parallel application, we<br>


> >>>> use<br>


> >>>> HPL (high performance linpack).  We run on all 16 cores on the system<br>


> >>>> with 1GB ram per core (a normal run should complete in appr. 150<br>


> >>>> seconds.)  The system monitoring shows a 10-20% system cpu overhead and<br>


> >>>> the HPL run takes more than 200 secs.  After running "echo 3 ><br>


> >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal<br>


> >>>> with<br>


> >>>> a run time at 150 secs.<br>


> >>>><br>


> >>>> I've created an infographic from our ganglia graphs for the above<br>


> >>>> scenario.<br>


> >>>><br>


> >>>> <a href="https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn">


https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn</a><br>


> >>>> g<br>


> >>>><br>


> >>>> Attached is an excerpt from perf top indicating that the kernel routine<br>


> >>>> taking the most time is _spin_lock_irqsave if that means anything to<br>


> >>>> anyone.<br>


> >>>><br>


> >>>><br>


> >>>> Things tested:<br>


> >>>><br>


> >>>> It does not seem to matter if we mount lustre over infiniband or<br>


> >>>> ethernet.<br>


> >>>><br>


> >>>> Filling the buffer cache with files from an NFS filesystem does not<br>


> >>>> degrade<br>


> >>>> performance.<br>


> >>>><br>


> >>>> Filling the buffer cache with one large file does not give degraded<br>


> >>>> performance. (tested with iozone)<br>


> >>>><br>


> >>>><br>


> >>>> Again, any hints on how to proceed is greatly appreciated.<br>


> >>>><br>


> >>>><br>


> >>>> Best regards,<br>


> >>>> Roy.<br>


> >>>><br>


> >>>><br>


> >>>><br>


> >>>> _______________________________________________<br>


> >>>> Lustre-discuss mailing list<br>


> >>>> Lustre-discuss@lists.lustre.org<br>


> >>>> <a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>


--<br>


<br>


  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.<br>


              phone:+47 77 64 41 07, fax:+47 77 64 41 00<br>


        Roy Dragseth, Team Leader, High Performance Computing<br>


         Direct call: +47 77 64 62 56. email: roy.dragseth@uit.no<br>


_______________________________________________<br>


Lustre-discuss mailing list<br>


Lustre-discuss@lists.lustre.org<br>


<a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>


</font></p>


</body>


</html>