[Lustre-discuss] Lustre buffer cache causes large system overhead.

Sat Aug 24 01:08:06 PDT 2013

The kernel docs for zone_reclaim_mode indicates that a value of 0 makes sense 
on dedicated file servers like MDS/OSS as fetching cached data from another 
numa domain is much faster than going all the way to the disk.  For clients 
that need the memory for computations a value of 1 seems to be the way to go 
as (I guess) it reduces the cross-domain traffic.

r.

On Friday 23. August 2013 13.59.44 Patrick Shopbell wrote:
> Hi all -
> I have watched this thread with much interest, and now I am even
> more interested/confused.  :-)
> 
> Several months back, we had a very substantial slowdown on our
> MDS box. Interactive use of the box was very sluggish, even
> though the load was quite low. This was eventually solved by
> setting the opposite value for the variable in question:
> 
> vm.zone_reclaim_mode = 0
> 
> And it was equally dramatic in its solution of our problem - the MDS
> started responding normally immediately afterwards. We went ahead
> and set the value to zero on all of our NUMA machines. (We are
> running Lustre 2.3.)
> 
> Clearly, I need to do some reading on Lustre and its various caching
> issues. This has been a quite interesting discussion.
> 
> Thanks everyone for such a great list.
> --
> Patrick
> 
> *--------------------------------------------------------------------*
> 
> | Patrick Shopbell               Department of Astronomy             |
> | pls at astro.caltech.edu          Mail Code 249-17                    |
> | (626) 395-4097                 California Institute of Technology  |
> | (626) 568-9352  (FAX)          Pasadena, CA 91125                 |
> | WWW: http://www.astro.caltech.edu/~pls/                            |
> 
> *--------------------------------------------------------------------*
> 
> On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:
> > Thanks for the suggestion!  It didn't help, but as I read the
> > documentation on vfs_cache_pressure in the kernel docs I noticed the next
> > parameter, zone_reclaim_mode, which looked like it might be worth
> > fiddling with.  And what do you know, changing it from 0 to 1 made the
> > system overhead vanish immediately!
> > 
> > I must admit I do not completely understand why this helps, but it seems
> > to do the trick in my case.  We'll put
> > 
> > vm.zone_reclaim_mode = 1
> > 
> > into /etc/sysctl.conf from now on.
> > 
> > Thanks to all for the hints and comments on this.
> > 
> > A nice weekend to everyone, mine for sure is going to be...
> > r.
> > 
> > On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> >> You might also try increasing the vfs_cache_pressure.
> >> 
> >> This will reclaim inode and dentry caches faster. Maybe that's the
> >> problem, not page caches.
> >> 
> >> To be clear - I have no deep insight into Lustre's use of the client
> >> cache, but you said you has lots of small files, which if lustre uses
> >> the cache system like other filesystems means it may be inodes/dentries.
> >> Filling up the page cache with files like you did in your other tests
> >> wouldn't have the same effect. Just my guess here.
> >> 
> >> We had some experience years ago with the opposite sort of problem. We
> >> have a big ftp server, and we want to *keep* inode/dentry data in the
> >> linux cache, as there are often stupid numbers of files in directories.
> >> Files were always flowing through the server, so the page cache would
> >> force out the inode cache. Was surprised to find with linux there's no
> >> ability to set a fixed inode cache size - the best you can do is
> >> "suggest" with the cache pressure tunable.
> >> 
> >> Scott
> >> 
> >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> >>> I tried to change swapiness from 0 to 95 but it did not have any impact
> >>> on
> >>> the system overhead.
> >>> 
> >>> r.
> >>> 
> >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >>>> No, I cannot detect any swap activity on the system.
> >>>> 
> >>>> r.
> >>>> 
> >>>> On Thursday 22. August 2013 09.21.33 you wrote:
> >>>>> Is this slowdown due to increased swap activity?  If "yes", then try
> >>>>> lowering the "swappiness" value.  This will sacrifice buffer cache
> >>>>> space
> >>>>> to
> >>>>> lower swap activity.
> >>>>> 
> >>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>>>> 
> >>>>> Roger S.
> >>>>> 
> >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>>>> We have just discovered that a large buffer cache generated from
> >>>>>> traversing a lustre file system will cause a significant system
> >>>>>> overhead
> >>>>>> for applications with high memory demands.  We have seen a 50%
> >>>>>> slowdown
> >>>>>> or worse for applications.  Even High Performance Linpack, that have
> >>>>>> no
> >>>>>> file IO whatsoever is affected.  The only remedy seems to be to empty
> >>>>>> the
> >>>>>> buffer cache from memory by running "echo 3 >
> >>>>>> /proc/sys/vm/drop_caches"
> >>>>>> 
> >>>>>> Any hints on how to improve the situation is greatly appreciated.
> >>>>>> 
> >>>>>> 
> >>>>>> System setup:
> >>>>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> >>>>>> connection
> >>>>>> to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
> >>>>>> and
> >>>>>> lustre v2.1.6 rpms downloaded from whamcloud download site.
> >>>>>> 
> >>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> >>>>>> site).
> >>>>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>>>> 
> >>>>>> How to reproduce:
> >>>>>> 
> >>>>>> Traverse the lustre file system until the buffer cache is large
> >>>>>> enough.
> >>>>>> In our case we run
> >>>>>> 
> >>>>>>     find . -print0 -type f | xargs -0 cat > /dev/null
> >>>>>> 
> >>>>>> on the client until the buffer cache reaches ~15-20GB.  (The lustre
> >>>>>> file
> >>>>>> system has lots of small files so this takes up to an hour.)
> >>>>>> 
> >>>>>> Kill the find process and start a single node parallel application,
> >>>>>> we
> >>>>>> use
> >>>>>> HPL (high performance linpack).  We run on all 16 cores on the system
> >>>>>> with 1GB ram per core (a normal run should complete in appr. 150
> >>>>>> seconds.)  The system monitoring shows a 10-20% system cpu overhead
> >>>>>> and
> >>>>>> the HPL run takes more than 200 secs.  After running "echo 3 >
> >>>>>> /proc/sys/vm/drop_caches" the system performance goes back to normal
> >>>>>> with
> >>>>>> a run time at 150 secs.
> >>>>>> 
> >>>>>> I've created an infographic from our ganglia graphs for the above
> >>>>>> scenario.
> >>>>>> 
> >>>>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.
> >>>>>> pn
> >>>>>> g
> >>>>>> 
> >>>>>> Attached is an excerpt from perf top indicating that the kernel
> >>>>>> routine
> >>>>>> taking the most time is _spin_lock_irqsave if that means anything to
> >>>>>> anyone.
> >>>>>> 
> >>>>>> 
> >>>>>> Things tested:
> >>>>>> 
> >>>>>> It does not seem to matter if we mount lustre over infiniband or
> >>>>>> ethernet.
> >>>>>> 
> >>>>>> Filling the buffer cache with files from an NFS filesystem does not
> >>>>>> degrade
> >>>>>> performance.
> >>>>>> 
> >>>>>> Filling the buffer cache with one large file does not give degraded
> >>>>>> performance. (tested with iozone)
> >>>>>> 
> >>>>>> 
> >>>>>> Again, any hints on how to proceed is greatly appreciated.
> >>>>>> 
> >>>>>> 
> >>>>>> Best regards,
> >>>>>> Roy.
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> Lustre-discuss mailing list
> >>>>>> Lustre-discuss at lists.lustre.org
> >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no