[lustre-discuss] ZFS backed OSS out of memory

Thu Jun 23 15:06:34 PDT 2016

1) https://github.com/zfsonlinux/zfs/issues/2581
suggests few things to monitor in /proc .  Searching for OOM at https://github.com/zfsonlinux/zfs/issues gives more hints where to look.

I guess OOM is not necessarily caused by zfs/spl.
Do you have lustre mounted on OSS and some process writing to it? (memory pressure).

2)
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLNL-ZFS.pdf
Last three pages.
2a) it may worth to set at /etc/modprobe.d/zfs.conf 
   options zfs zfs_prefetch_disable=1

2b) did you set metaslab_debug_unload ? It increases memory consumption.

Can you correlate OOM with some type of activity (read; write; scrub; snapshot delete)?
Do you actually re-read same data? ARC helps to the second read. 
Having 64GB in memory ARC seems a lot together with L2ARC on SSD.
lustre does not use zfs slog IIRC.

3) do you have option to upgrade zfs?

4) you may setup monitoring and feed zfs and lustre stats to influxdb (monitoring node) with telegraf (OSS). Both at influxdata.com. I have DB on SSD. Plot data with grafana, or query directly from influxdb. 
> # fgrep plugins /etc/opt/telegraf/telegraf.conf
> ...
> [plugins]
> [[plugins.cpu]]
> [[plugins.disk]]
> [[plugins.io]]
> [[plugins.mem]]
> [[plugins.swap]]
> [[plugins.system]]
> [[plugins.zfs]]
> [[plugins.lustre2]]

5) drop caches  with echo 3 > /proc/sys/vm/drop_caches .  If it helps add to cron to avoid OOM kills.

Alex.

> Folks,
> 
> I've done my fair share of googling and run across some good information on ZFS backed Lustre tuning including this:
> 
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLNL-ZFS.pdf
> 
> and various discussions around how to limit (or not) the ARC and clear it if needed.
> 
> That being said, here is my configuration.
> 
> RHEL 6 
> Kernel 2.6.32-504.3.3.el6.x86_64
> ZFS 0.6.3
> Lustre 2.5.3 with a couple of patches
> Single OST per OSS with 4 x RAIDZ2 4TB SAS drives
> Log and Cache on separate SSDs
> These OSSes are beefy with 128GB of memory and Dual E5-2630 v2 CPUs
> 
> About 30 OSSes in all serving mostly a standard HPC cluster over FDR IB with a sprinkle of 10G
> 
> # more /etc/modprobe.d/lustre.conf
> options lnet networks=o2ib9,tcp9(eth0)
> 
> ZFS backed MDS with same software stack.
> 
> The problem I am having is the OOM killer is whacking away at system processes on a few of the OSSes. 
> 
> "top" shows all my memory is in use with very little Cache or Buffer usage.
> 
> Tasks: 1429 total,   5 running, 1424 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  2.9%sy,  0.0%ni, 94.0%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  132270088k total, 131370888k used,   899200k free,     1828k buffers
> Swap: 61407100k total,     7940k used, 61399160k free,    10488k cached
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   47 root      RT   0     0    0    0 S 30.0  0.0 372:57.33 migration/11
> 
> I had done zero tuning so I am getting the default ARC size of 1/2 the memory.
> 
> [root at lzfs18b ~]# arcstat.py 1
>    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
> 09:11:50     0     0      0     0    0     0    0     0    0    63G   63G
> 09:11:51  6.2K  2.6K     41   206    6  2.4K   71     0    0    63G   63G
> 09:11:52   21K  4.0K     18   305    2  3.7K   34    18    0    63G   63G
> 
> The question is, if I have 128GB of RAM and ARC is only taking 63, where did the rest go and how can I get it back so that the OOM killer stops killing me?
> 
> Thanks!
> 
> Tim
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org