[Lustre-discuss] Lustre buffer cache causes large system overhead.

Thu Aug 22 06:51:32 PDT 2013

We have just discovered that a large buffer cache generated from traversing a 
lustre file system will cause a significant system overhead for applications 
with high memory demands.  We have seen a 50% slowdown or worse for 
applications.  Even High Performance Linpack, that have no file IO whatsoever 
is affected.  The only remedy seems to be to empty the buffer cache from memory 
by running "echo 3 > /proc/sys/vm/drop_caches"

Any hints on how to improve the situation is greatly appreciated.

System setup:
Client: Dual socket Sandy Bridge, with 32GB ram and infiniband connection to 
lustre server.  CentOS 6.4, with kernel 2.6.32-358.11.1.el6.x86_64 and lustre 
v2.1.6 rpms downloaded from whamcloud download site.

Lustre: 1 MDS and 4 OSS running Lustre 2.1.3 (also from whamcloud site).  Each 
OSS has 12 OST, total 1.1 PB storage.

How to reproduce:

Traverse the lustre file system until the buffer cache is large enough.  In our 
case we run

 find . -print0 -type f | xargs -0 cat > /dev/null

on the client until the buffer cache reaches ~15-20GB.  (The lustre file system 
has lots of small files so this takes up to an hour.)

Kill the find process and start a single node parallel application, we use HPL 
(high performance linpack).  We run on all 16 cores on the system with 1GB ram 
per core (a normal run should complete in appr. 150 seconds.)  The system 
monitoring shows a 10-20% system cpu overhead and the HPL run takes more than 
200 secs.  After running "echo 3 > /proc/sys/vm/drop_caches" the system 
performance goes back to normal with a run time at 150 secs.

I've created an infographic from our ganglia graphs for the above scenario.

https://dl.dropboxusercontent.com/u/23468442/misc/lustre_bc_overhead.png

Attached is an excerpt from perf top indicating that the kernel routine taking 
the most time is _spin_lock_irqsave if that means anything to anyone.

Things tested:

It does not seem to matter if we mount lustre over infiniband or ethernet.

Filling the buffer cache with files from an NFS filesystem does not degrade 
performance.

Filling the buffer cache with one large file does not give degraded performance. 
(tested with iozone)

Again, any hints on how to proceed is greatly appreciated.

Best regards,
Roy.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
        Roy Dragseth, Team Leader, High Performance Computing
	 Direct call: +47 77 64 62 56. email: roy.dragseth at uit.no
-------------- next part --------------
Samples: 6M of event 'cycles', Event count (approx.): 634546877255
 62.19%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_kernel_0
 13.30%  mca_btl_sm.so                [.] mca_btl_sm_component_progress
  8.80%  libmpi.so.1.0.3              [.] opal_progress
  5.29%  [kernel]                     [k] _spin_lock_irqsave
  1.41%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_copyan
  1.17%  mca_pml_ob1.so               [.] mca_pml_ob1_progress
  0.88%  libmkl_avx.so                [.] mkl_blas_avx_dtrsm_ker_ruu_a4_b8
  0.41%  [kernel]                     [k] compaction_alloc
  0.38%  [kernel]                     [k] _spin_lock_irq
  0.36%  mca_pml_ob1.so               [.] opal_progress at plt
  0.33%  xhpl                         [.] HPL_dlaswp06T
  0.28%  libmkl_avx.so                [.] mkl_blas_avx_dgemm_copybt
  0.24%  mca_pml_ob1.so               [.] mca_pml_ob1_send
  0.18%  [kernel]                     [k] _spin_lock
  0.17%  [kernel]                     [k] __mem_cgroup_commit_charge
  0.16%  [kernel]                     [k] mem_cgroup_lru_del_list
  0.16%  [kernel]                     [k] putback_lru_page
  0.16%  [kernel]                     [k] __mem_cgroup_uncharge_common
  0.15%  xhpl                         [.] HPL_dlatcpy
  0.15%  xhpl                         [.] HPL_dlaswp01T
  0.15%  [kernel]                     [k] clear_page_c
  0.15%  xhpl                         [.] HPL_dlaswp10N
  0.13%  [kernel]                     [k] list_del
  0.13%  [kernel]                     [k] free_hot_cold_page
  0.13%  [kernel]                     [k] free_pcppages_bulk
  0.13%  [kernel]                     [k] release_pages
  0.13%  mca_pml_ob1.so               [.] mca_pml_ob1_recv
  0.12%  [kernel]                     [k] ____pagevec_lru_add
  0.12%  [kernel]                     [k] copy_user_generic_string
  0.12%  [kernel]                     [k] compact_zone
  0.10%  xhpl                         [.] __intel_ssse3_rep_memcpy
  0.10%  [kernel]                     [k] __list_add
  0.10%  [kernel]                     [k] lookup_page_cgroup
  0.09%  [kernel]                     [k] mem_cgroup_end_migration
  0.08%  [kernel]                     [k] mem_cgroup_prepare_migration
  0.08%  [kernel]                     [k] get_pageblock_flags_group
  0.08%  [kernel]                     [k] page_waitqueue
  0.07%  [kernel]                     [k] migrate_pages
  0.07%  [kernel]                     [k] __wake_up_bit
  0.07%  [kernel]                     [k] get_page
  0.07%  [kernel]                     [k] unlock_page
  0.07%  [kernel]                     [k] mem_cgroup_lru_add_list
  0.06%  [kernel]                     [k] page_fault
  0.06%  [kernel]                     [k] __alloc_pages_nodemask
  0.06%  [kernel]                     [k] put_page
  0.06%  [kernel]                     [k] compact_checklock_irqsave