[Lustre-discuss] Lustre buffer cache causes large system overhead.

Brian O'Connor briano at sgi.com
Fri Aug 23 18:58:59 PDT 2013


Watch for swapping now. Turning zone reclaim on can cause the machine to swap if the memory use goes outside of the NUMA node.

Although you don't have much memory(which IMHO is the real issue)
so this may not effect you.

-----Original Message-----
From: Dragseth Roy Einar [roy.dragseth at uit.no<mailto:roy.dragseth at uit.no>]
Sent: Friday, August 23, 2013 03:09 PM Central Standard Time
To: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system overhead.


Thanks for the suggestion!  It didn't help, but as I read the documentation on
vfs_cache_pressure in the kernel docs I noticed the next parameter,
zone_reclaim_mode, which looked like it might be worth fiddling with.  And what
do you know, changing it from 0 to 1 made the system overhead vanish
immediately!

I must admit I do not completely understand why this helps, but it seems to do
the trick in my case.  We'll put

vm.zone_reclaim_mode = 1

into /etc/sysctl.conf from now on.

Thanks to all for the hints and comments on this.

A nice weekend to everyone, mine for sure is going to be...
r.


On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> You might also try increasing the vfs_cache_pressure.
>
> This will reclaim inode and dentry caches faster. Maybe that's the
> problem, not page caches.
>
> To be clear - I have no deep insight into Lustre's use of the client
> cache, but you said you has lots of small files, which if lustre uses
> the cache system like other filesystems means it may be inodes/dentries.
> Filling up the page cache with files like you did in your other tests
> wouldn't have the same effect. Just my guess here.
>
> We had some experience years ago with the opposite sort of problem. We
> have a big ftp server, and we want to *keep* inode/dentry data in the
> linux cache, as there are often stupid numbers of files in directories.
> Files were always flowing through the server, so the page cache would
> force out the inode cache. Was surprised to find with linux there's no
> ability to set a fixed inode cache size - the best you can do is
> "suggest" with the cache pressure tunable.
>
> Scott
>
> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> > I tried to change swapiness from 0 to 95 but it did not have any impact on
> > the system overhead.
> >
> > r.
> >
> > On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >> No, I cannot detect any swap activity on the system.
> >>
> >> r.
> >>
> >> On Thursday 22. August 2013 09.21.33 you wrote:
> >>> Is this slowdown due to increased swap activity?  If "yes", then try
> >>> lowering the "swappiness" value.  This will sacrifice buffer cache space
> >>> to
> >>> lower swap activity.
> >>>
> >>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>>
> >>> Roger S.
> >>>
> >>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>> We have just discovered that a large buffer cache generated from
> >>>> traversing a lustre file system will cause a significant system
> >>>> overhead
> >>>> for applications with high memory demands.  We have seen a 50% slowdown
> >>>> or worse for applications.  Even High Performance Linpack, that have no
> >>>> file IO whatsoever is affected.  The only remedy seems to be to empty
> >>>> the
> >>>> buffer cache from memory by running "echo 3 > /proc/sys/vm/drop_caches"
> >>>>
> >>>> Any hints on how to improve the situation is greatly appreciated.
> >>>>
> >>>>
> >>>> System setup:
> >>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> >>>> connection
> >>>> to lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
> >>>> and
> >>>> lustre v2.1.6 rpms downloaded from whamcloud download site.
> >>>>
> >>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> >>>> site).
> >>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>>
> >>>> How to reproduce:
> >>>>
> >>>> Traverse the lustre file system until the buffer cache is large enough.
> >>>> In our case we run
> >>>>
> >>>>    find . -print0 -type f | xargs -0 cat > /dev/null
> >>>>
> >>>> on the client until the buffer cache reaches ~15-20GB.  (The lustre
> >>>> file
> >>>> system has lots of small files so this takes up to an hour.)
> >>>>
> >>>> Kill the find process and start a single node parallel application, we
> >>>> use
> >>>> HPL (high performance linpack).  We run on all 16 cores on the system
> >>>> with 1GB ram per core (a normal run should complete in appr. 150
> >>>> seconds.)  The system monitoring shows a 10-20% system cpu overhead and
> >>>> the HPL run takes more than 200 secs.  After running "echo 3 >
> >>>> /proc/sys/vm/drop_caches" the system performance goes back to normal
> >>>> with
> >>>> a run time at 150 secs.
> >>>>
> >>>> I've created an infographic from our ganglia graphs for the above
> >>>> scenario.
> >>>>
> >>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.pn
> >>>> g
> >>>>
> >>>> Attached is an excerpt from perf top indicating that the kernel routine
> >>>> taking the most time is _spin_lock_irqsave if that means anything to
> >>>> anyone.
> >>>>
> >>>>
> >>>> Things tested:
> >>>>
> >>>> It does not seem to matter if we mount lustre over infiniband or
> >>>> ethernet.
> >>>>
> >>>> Filling the buffer cache with files from an NFS filesystem does not
> >>>> degrade
> >>>> performance.
> >>>>
> >>>> Filling the buffer cache with one large file does not give degraded
> >>>> performance. (tested with iozone)
> >>>>
> >>>>
> >>>> Again, any hints on how to proceed is greatly appreciated.
> >>>>
> >>>>
> >>>> Best regards,
> >>>> Roy.
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Lustre-discuss mailing list
> >>>> Lustre-discuss at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
--

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
              phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
         Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130824/e9d34f78/attachment.htm>


More information about the lustre-discuss mailing list