[Lustre-discuss] Lustre buffer cache causes large system overhead.
Dragseth Roy Einar
roy.dragseth at uit.no
Sat Aug 24 01:08:06 PDT 2013
The kernel docs for zone_reclaim_mode indicates that a value of 0 makes sense
on dedicated file servers like MDS/OSS as fetching cached data from another
numa domain is much faster than going all the way to the disk. For clients
that need the memory for computations a value of 1 seems to be the way to go
as (I guess) it reduces the cross-domain traffic.
r.
On Friday 23. August 2013 13.59.44 Patrick Shopbell wrote:
> Hi all -
> I have watched this thread with much interest, and now I am even
> more interested/confused. :-)
>
> Several months back, we had a very substantial slowdown on our
> MDS box. Interactive use of the box was very sluggish, even
> though the load was quite low. This was eventually solved by
> setting the opposite value for the variable in question:
>
> vm.zone_reclaim_mode = 0
>
> And it was equally dramatic in its solution of our problem - the MDS
> started responding normally immediately afterwards. We went ahead
> and set the value to zero on all of our NUMA machines. (We are
> running Lustre 2.3.)
>
> Clearly, I need to do some reading on Lustre and its various caching
> issues. This has been a quite interesting discussion.
>
> Thanks everyone for such a great list.
> --
> Patrick
>
> *--------------------------------------------------------------------*
>
> | Patrick Shopbell Department of Astronomy |
> | pls at astro.caltech.edu Mail Code 249-17 |
> | (626) 395-4097 California Institute of Technology |
> | (626) 568-9352 (FAX) Pasadena, CA 91125 |
> | WWW: http://www.astro.caltech.edu/~pls/ |
>
> *--------------------------------------------------------------------*
>
> On 8/23/13 1:08 PM, Dragseth Roy Einar wrote:
> > Thanks for the suggestion! It didn't help, but as I read the
> > documentation on vfs_cache_pressure in the kernel docs I noticed the next
> > parameter, zone_reclaim_mode, which looked like it might be worth
> > fiddling with. And what do you know, changing it from 0 to 1 made the
> > system overhead vanish immediately!
> >
> > I must admit I do not completely understand why this helps, but it seems
> > to do the trick in my case. We'll put
> >
> > vm.zone_reclaim_mode = 1
> >
> > into /etc/sysctl.conf from now on.
> >
> > Thanks to all for the hints and comments on this.
> >
> > A nice weekend to everyone, mine for sure is going to be...
> > r.
> >
> > On Friday 23. August 2013 09.36.34 Scott Nolin wrote:
> >> You might also try increasing the vfs_cache_pressure.
> >>
> >> This will reclaim inode and dentry caches faster. Maybe that's the
> >> problem, not page caches.
> >>
> >> To be clear - I have no deep insight into Lustre's use of the client
> >> cache, but you said you has lots of small files, which if lustre uses
> >> the cache system like other filesystems means it may be inodes/dentries.
> >> Filling up the page cache with files like you did in your other tests
> >> wouldn't have the same effect. Just my guess here.
> >>
> >> We had some experience years ago with the opposite sort of problem. We
> >> have a big ftp server, and we want to *keep* inode/dentry data in the
> >> linux cache, as there are often stupid numbers of files in directories.
> >> Files were always flowing through the server, so the page cache would
> >> force out the inode cache. Was surprised to find with linux there's no
> >> ability to set a fixed inode cache size - the best you can do is
> >> "suggest" with the cache pressure tunable.
> >>
> >> Scott
> >>
> >> On 8/23/2013 6:29 AM, Dragseth Roy Einar wrote:
> >>> I tried to change swapiness from 0 to 95 but it did not have any impact
> >>> on
> >>> the system overhead.
> >>>
> >>> r.
> >>>
> >>> On Thursday 22. August 2013 15.38.37 Dragseth Roy Einar wrote:
> >>>> No, I cannot detect any swap activity on the system.
> >>>>
> >>>> r.
> >>>>
> >>>> On Thursday 22. August 2013 09.21.33 you wrote:
> >>>>> Is this slowdown due to increased swap activity? If "yes", then try
> >>>>> lowering the "swappiness" value. This will sacrifice buffer cache
> >>>>> space
> >>>>> to
> >>>>> lower swap activity.
> >>>>>
> >>>>> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> >>>>>
> >>>>> Roger S.
> >>>>>
> >>>>> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> >>>>>> We have just discovered that a large buffer cache generated from
> >>>>>> traversing a lustre file system will cause a significant system
> >>>>>> overhead
> >>>>>> for applications with high memory demands. We have seen a 50%
> >>>>>> slowdown
> >>>>>> or worse for applications. Even High Performance Linpack, that have
> >>>>>> no
> >>>>>> file IO whatsoever is affected. The only remedy seems to be to empty
> >>>>>> the
> >>>>>> buffer cache from memory by running "echo 3 >
> >>>>>> /proc/sys/vm/drop_caches"
> >>>>>>
> >>>>>> Any hints on how to improve the situation is greatly appreciated.
> >>>>>>
> >>>>>>
> >>>>>> System setup:
> >>>>>> Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> >>>>>> connection
> >>>>>> to lustre server. CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64
> >>>>>> and
> >>>>>> lustre v2.1.6 rpms downloaded from whamcloud download site.
> >>>>>>
> >>>>>> Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> >>>>>> site).
> >>>>>> Each OSS has 12 OST, total 1.1 PB storage.
> >>>>>>
> >>>>>> How to reproduce:
> >>>>>>
> >>>>>> Traverse the lustre file system until the buffer cache is large
> >>>>>> enough.
> >>>>>> In our case we run
> >>>>>>
> >>>>>> find . -print0 -type f | xargs -0 cat > /dev/null
> >>>>>>
> >>>>>> on the client until the buffer cache reaches ~15-20GB. (The lustre
> >>>>>> file
> >>>>>> system has lots of small files so this takes up to an hour.)
> >>>>>>
> >>>>>> Kill the find process and start a single node parallel application,
> >>>>>> we
> >>>>>> use
> >>>>>> HPL (high performance linpack). We run on all 16 cores on the system
> >>>>>> with 1GB ram per core (a normal run should complete in appr. 150
> >>>>>> seconds.) The system monitoring shows a 10-20% system cpu overhead
> >>>>>> and
> >>>>>> the HPL run takes more than 200 secs. After running "echo 3 >
> >>>>>> /proc/sys/vm/drop_caches" the system performance goes back to normal
> >>>>>> with
> >>>>>> a run time at 150 secs.
> >>>>>>
> >>>>>> I've created an infographic from our ganglia graphs for the above
> >>>>>> scenario.
> >>>>>>
> >>>>>> https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.
> >>>>>> pn
> >>>>>> g
> >>>>>>
> >>>>>> Attached is an excerpt from perf top indicating that the kernel
> >>>>>> routine
> >>>>>> taking the most time is _spin_lock_irqsave if that means anything to
> >>>>>> anyone.
> >>>>>>
> >>>>>>
> >>>>>> Things tested:
> >>>>>>
> >>>>>> It does not seem to matter if we mount lustre over infiniband or
> >>>>>> ethernet.
> >>>>>>
> >>>>>> Filling the buffer cache with files from an NFS filesystem does not
> >>>>>> degrade
> >>>>>> performance.
> >>>>>>
> >>>>>> Filling the buffer cache with one large file does not give degraded
> >>>>>> performance. (tested with iozone)
> >>>>>>
> >>>>>>
> >>>>>> Again, any hints on how to proceed is greatly appreciated.
> >>>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Roy.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Lustre-discuss mailing list
> >>>>>> Lustre-discuss at lists.lustre.org
> >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
--
The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
More information about the lustre-discuss
mailing list