[lustre-discuss] ZFS backed OSS out of memory

Fri Jun 24 15:08:32 PDT 2016

Alex,

Answers to your questions.

1) I'll do more looking on github for OOM problems. I'll post if I find a resolution that fits my problem. I do not have Lustre mounted on the OSS.

2) We have not set any of the parameters but they do look interesting and possibly related. The OOM seems to be related with large read activity from the cluster but we haven't totally made that correlation. There are no active scrubs or snapshots when we see the OOM killer fire.

3) We're looking at possible upgrade paths and to where the stars would align on a happy version of Lustre/ZFS that isn't too different from what we have now.

4) We are going to add more monitoring to our ZFS config. The standard "arcstat.py" really shows nothing interesting other than our 64GB ARC is always full. 

5) The "drop caches" has no impact. If you see my "top" output, there is nothing in the cache.

Thanks for the pointers. We'll keep investigating and probably implement a couple of the settings in (2).

Tim

-----Original Message-----
From: Alexander I Kulyavtsev [mailto:aik at fnal.gov] 
Sent: Thursday, June 23, 2016 3:07 PM
To: Carlson, Timothy S <Timothy.Carlson at pnnl.gov>
Cc: Alexander I Kulyavtsev <aik at fnal.gov>; lustre-discuss at lists.lustre.org
Subject: Re: [lustre-discuss] ZFS backed OSS out of memory

1) https://github.com/zfsonlinux/zfs/issues/2581
suggests few things to monitor in /proc .  Searching for OOM at https://github.com/zfsonlinux/zfs/issues gives more hints where to look.

I guess OOM is not necessarily caused by zfs/spl.
Do you have lustre mounted on OSS and some process writing to it? (memory pressure).

2)
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLN
> L-ZFS.pdf
Last three pages.
2a) it may worth to set at /etc/modprobe.d/zfs.conf 
   options zfs zfs_prefetch_disable=1

2b) did you set metaslab_debug_unload ? It increases memory consumption.

Can you correlate OOM with some type of activity (read; write; scrub; snapshot delete)?
Do you actually re-read same data? ARC helps to the second read. 
Having 64GB in memory ARC seems a lot together with L2ARC on SSD.
lustre does not use zfs slog IIRC.

3) do you have option to upgrade zfs?

4) you may setup monitoring and feed zfs and lustre stats to influxdb (monitoring node) with telegraf (OSS). Both at influxdata.com. I have DB on SSD. Plot data with grafana, or query directly from influxdb. 
> # fgrep plugins /etc/opt/telegraf/telegraf.conf ...
> [plugins]
> [[plugins.cpu]]
> [[plugins.disk]]
> [[plugins.io]]
> [[plugins.mem]]
> [[plugins.swap]]
> [[plugins.system]]
> [[plugins.zfs]]
> [[plugins.lustre2]]

5) drop caches  with echo 3 > /proc/sys/vm/drop_caches .  If it helps add to cron to avoid OOM kills.

Alex.

> Folks,
> 
> I've done my fair share of googling and run across some good information on ZFS backed Lustre tuning including this:
> 
> http://lustre.ornl.gov/ecosystem-2016/documents/tutorials/Stearman-LLN
> L-ZFS.pdf
> 
> and various discussions around how to limit (or not) the ARC and clear it if needed.
> 
> That being said, here is my configuration.
> 
> RHEL 6
> Kernel 2.6.32-504.3.3.el6.x86_64
> ZFS 0.6.3
> Lustre 2.5.3 with a couple of patches
> Single OST per OSS with 4 x RAIDZ2 4TB SAS drives Log and Cache on 
> separate SSDs These OSSes are beefy with 128GB of memory and Dual 
> E5-2630 v2 CPUs
> 
> About 30 OSSes in all serving mostly a standard HPC cluster over FDR 
> IB with a sprinkle of 10G
> 
> # more /etc/modprobe.d/lustre.conf
> options lnet networks=o2ib9,tcp9(eth0)
> 
> ZFS backed MDS with same software stack.
> 
> The problem I am having is the OOM killer is whacking away at system processes on a few of the OSSes. 
> 
> "top" shows all my memory is in use with very little Cache or Buffer usage.
> 
> Tasks: 1429 total,   5 running, 1424 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.0%us,  2.9%sy,  0.0%ni, 94.0%id,  3.1%wa,  0.0%hi,  0.0%si,  0.0%st
> Mem:  132270088k total, 131370888k used,   899200k free,     1828k buffers
> Swap: 61407100k total,     7940k used, 61399160k free,    10488k cached
> 
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>   47 root      RT   0     0    0    0 S 30.0  0.0 372:57.33 migration/11
> 
> I had done zero tuning so I am getting the default ARC size of 1/2 the memory.
> 
> [root at lzfs18b ~]# arcstat.py 1
>    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
> 09:11:50     0     0      0     0    0     0    0     0    0    63G   63G
> 09:11:51  6.2K  2.6K     41   206    6  2.4K   71     0    0    63G   63G
> 09:11:52   21K  4.0K     18   305    2  3.7K   34    18    0    63G   63G
> 
> The question is, if I have 128GB of RAM and ARC is only taking 63, where did the rest go and how can I get it back so that the OOM killer stops killing me?
> 
> Thanks!
> 
> Tim
> 
> 
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org