[Lustre-discuss] Lustre buffer cache causes large system overhead.

Fri Aug 23 13:08:49 PDT 2013

Thanks for the suggestion!  It didn't help, but as I read the documentation on 
vfs_cache_pressure in the kernel docs I noticed the next parameter, 
zone_reclaim_mode, which looked like it might be worth fiddling with.  And what 
do you know, changing it from 0 to 1 made the system overhead vanish 
immediately!

I must admit I do not completely understand why this helps, but it seems to do 
the trick in my case.  We'll put 

vm.zone_reclaim_mode = 1 

into /etc/sysctl.conf from now on.

Thanks to all for the hints and comments on this.

A nice weekend to everyone, mine for sure is going to be...
r.

On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> You might also try increasing the vfs_cache_pressure.
> 
> This will reclaim inode and dentry caches faster. Maybe that's the
> problem, not page caches.
> 
> To be clear - I have no deep insight into Lustre's use of the client
> cache, but you said you has lots of small files, which if lustre uses
> the cache system like other filesystems means it may be inodes/dentries.
> Filling up the page cache with files like you did in your other tests
> wouldn't have the same effect. Just my guess here.
> 
> We had some experience years ago with the opposite sort of problem. We
> have a big ftp server, and we want to *keep* inode/dentry data in the
> linux cache, as there are often stupid numbers of files in directories.
> Files were always flowing through the server, so the page cache would
> force out the inode cache. Was surprised to find with linux there's no
> ability to set a fixed inode cache size - the best you can do is
> "suggest" with the cache pressure tunable.
> 
> Scott
> 
> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> > I tried to change swapiness from 0 to 95 but it did not have any impact on
> > the system overhead.
> > 
> > r.
> > 
> > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >> No, I cannot detect any swap activity on the system.
> >> 
> >> r.
> >> 
> >> On Thursday 22. August 2013 09.21.33 you wrote:
> >>> Is this slowdown due to increased swap activity?  If "yes", then try
> >>> lowering the "swappiness" value.  This will sacrifice buffer cache space
> >>> to
> >>> lower swap activity.
> >>> 
> >>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>> 
> >>> Roger S.
> >>> 
> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>> We have just discovered that a large buffer cache generated from
> >>>> traversing a lustre file system will cause a significant system
> >>>> overhead
> >>>> for applications with high memory demands.  We have seen a 50% slowdown
> >>>> or worse for applications.  Even High Performance Linpack, that have no
> >>>> file IO whatsoever is affected.  The only remedy seems to be to empty
> >>>> the
> >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches"
> >>>> 
> >>>> Any hints on how to improve the situation is greatly appreciated.
> >>>> 
> >>>> 
> >>>> System setup:
> >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> >>>> connection
> >>>> to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
> >>>> and
> >>>> lustre v2.1.6 rpms downloaded from whamcloud download site.
> >>>> 
> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> >>>> site).
> >>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>> 
> >>>> How to reproduce:
> >>>> 
> >>>> Traverse the lustre file system until the buffer cache is large enough.
> >>>> In our case we run
> >>>> 
> >>>>    find . -print0 -type f | xargs -0 cat > /dev/null
> >>>> 
> >>>> on the client until the buffer cache reaches ~15-20GB.  (The lustre
> >>>> file
> >>>> system has lots of small files so this takes up to an hour.)
> >>>> 
> >>>> Kill the find process and start a single node parallel application, we
> >>>> use
> >>>> HPL (high performance linpack).  We run on all 16 cores on the system
> >>>> with 1GB ram per core (a normal run should complete in appr. 150
> >>>> seconds.)  The system monitoring shows a 10-20% system cpu overhead and
> >>>> the HPL run takes more than 200 secs.  After running "echo 3 >
> >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal
> >>>> with
> >>>> a run time at 150 secs.
> >>>> 
> >>>> I've created an infographic from our ganglia graphs for the above
> >>>> scenario.
> >>>> 
> >>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn
> >>>> g
> >>>> 
> >>>> Attached is an excerpt from perf top indicating that the kernel routine
> >>>> taking the most time is _spin_lock_irqsave if that means anything to
> >>>> anyone.
> >>>> 
> >>>> 
> >>>> Things tested:
> >>>> 
> >>>> It does not seem to matter if we mount lustre over infiniband or
> >>>> ethernet.
> >>>> 
> >>>> Filling the buffer cache with files from an NFS filesystem does not
> >>>> degrade
> >>>> performance.
> >>>> 
> >>>> Filling the buffer cache with one large file does not give degraded
> >>>> performance. (tested with iozone)
> >>>> 
> >>>> 
> >>>> Again, any hints on how to proceed is greatly appreciated.
> >>>> 
> >>>> 
> >>>> Best regards,
> >>>> Roy.
> >>>> 
> >>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> Lustre-discuss mailing list
> >>>> Lustre-discuss at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no