[Lustre-discuss] Lustre buffer cache causes large system overhead.

Dragseth Roy Einar roy.dragseth at uit.no
Thu Aug 22 08:38:16 PDT 2013


Yes, we have also started emptying the BC on job startup, but it doesn't seem 
to cover all cases.  We see similar symptoms on applications using netcdf even 
if we drop the BC at job startup.   The application writes a netcdf-file at 
300-500 MB/s for 3-5 secs, then after the IO is done the client will spend 
100% in _spin_lock_irqsave for up to a minute.  The data has clearly left the 
client as no ib-traffic is detected during or after the spin_lock time until the 
application has completed a new time step and writes a new data chunk.  The 
application use appr. 1.2GB per core so the scenario is quite similar to the 
syntetic one I reported.

r.


On Thursday 22. August 2013 07.40.01 you wrote:
> FWIW, we have seen the same issues with Lustre 1.8.x and slightly older
> RHEL6 kernel.  We do the "echo" as part of our slurm prolog/epilog scripts.
> Not a fix but a workaround before/after jobs run.  No swap activity, but
> very large buffer cache in use.
> 
> Tim
> 
> -----Original Message-----
> From: lustre-discuss-bounces at lists.lustre.org
> [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Roger Sersted
> Sent: Thursday, August 22, 2013 7:22 AM
> To: lustre-discuss at lists.lustre.org
> Cc: Roy Dragseth
> Subject: Re: [Lustre-discuss] Lustre buffer cache causes large system
> overhead.
> 
> 
> 
> 
> Is this slowdown due to increased swap activity?  If "yes", then try
> lowering the "swappiness" value.  This will sacrifice buffer cache space to
> lower swap activity.
> 
> Take a look at http://en.wikipedia.org/wiki/Swappiness.
> 
> Roger S.
> 
> On 08/22/2013 08:51 AM, Roy Dragseth wrote:
> > We have just discovered that a large buffer cache generated from
> > traversing a lustre file system will cause a significant system
> > overhead for applications with high memory demands.  We have seen a
> > 50% slowdown or worse for applications.  Even High Performance
> > Linpack, that have no file IO whatsoever is affected.  The only remedy
> > seems to be to empty the buffer cache from memory by running "echo 3 >
> > /proc/sys/vm/drop_caches"
> > 
> > Any hints on how to improve the situation is greatly appreciated.
> > 
> > 
> > System setup:
> > Client: Dual socket Sandy Bridge, with 32GB ram and infiniband
> > connection to lustre server.  CentOS 6.4, with kernel
> > 2.6.32-358.11.1.el6.x86_64 and lustre
> > v2.1.6 rpms downloaded from whamcloud download site.
> > 
> > Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud
> > site).  Each OSS has 12 OST, total 1.1 PB storage.
> > 
> > How to reproduce:
> > 
> > Traverse the lustre file system until the buffer cache is large
> > enough.  In our case we run
> > 
> >   find . -print0 -type f | xargs -0 cat > /dev/null
> > 
> > on the client until the buffer cache reaches ~15-20GB.  (The lustre
> > file system has lots of small files so this takes up to an hour.)
> > 
> > Kill the find process and start a single node parallel application, we
> > use HPL (high performance linpack).  We run on all 16 cores on the
> > system with 1GB ram per core (a normal run should complete in appr.
> > 150 seconds.)  The system monitoring shows a 10-20% system cpu
> > overhead and the HPL run takes more than
> > 200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches" the
> > system performance goes back to normal with a run time at 150 secs.
> > 
> > I've created an infographic from our ganglia graphs for the above
> > scenario.
> > 
> > https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.p
> > ng
> > 
> > Attached is an excerpt from perf top indicating that the kernel
> > routine taking the most time is _spin_lock_irqsave if that means anything
> > to anyone.
> > 
> > 
> > Things tested:
> > 
> > It does not seem to matter if we mount lustre over infiniband or ethernet.
> > 
> > Filling the buffer cache with files from an NFS filesystem does not
> > degrade performance.
> > 
> > Filling the buffer cache with one large file does not give degraded
> > performance. (tested with iozone)
> > 
> > 
> > Again, any hints on how to proceed is greatly appreciated.
> > 
> > 
> > Best regards,
> > Roy.
> > 
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no



More information about the lustre-discuss mailing list