[lustre-discuss] Lustre 2.10.3 on ZFS - slow read performance

Tue Mar 27 12:55:23 PDT 2018

Hi,

I'm setting up the new lustre test setup with the following hw config:
- 2x servers (dual E5-2650v3, 128GB RAM), one MGS/MDS, one OSS
- 1x HGST 4U60G2 JBOD with 60x 10TB HUH721010AL5204 drives (4k
physical, 512 logical sector size), connected to OSS using lsi 9300-8e

Lustre 2.10.3 servers/clients (centos 7.4), zfs - 0.7.5 and also 0.7.7

Initially I planned to use 2 zpools with three 8+2 vdevs or 1 zpool
with six 8+2 vdevs.

I created zpool with:
"zpool create -o multihost=on   -O canmount=off  -O recordsize=1024K
-O compression=off  -o cachefile=none  -o ashift=12  l2oss1 raidz2
d0..d9 raidz2 d10..d19 raidz2 d20..29"

Benchmarking shown poor read performance:
1) obdfilter-survey
Obdfilter-survey for case=disk from oss1
ost  1 sz 163840000K rsz 1024K obj    1 thr    1 write 2092.43 [
785.76, 3200.18] rewrite 2154.08 [ 800.56, 3033.77] read  525.03 [
70.00, 2048.87]

2) IOR single thread (./ior -a POSIX -F -rw -e -b 128g -t 1m -i 1 -o /l1/tmp/
ccess    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)
close(s)   total(s)   iter
------    ---------  ---------- ---------  --------   --------
--------   --------   ----
write     679.45     134217728  1024.00    0.001474   192.91
0.000845   192.91     0
read      195.23     134217728  1024.00    0.001215   671.38
0.000871   671.38     0
remove    -          -          -          -          -          -
     20.52      0

3) iozone also shown ~2GB/s writes and 0.8GB/s reads

So reads are 3-5 times lower then writes.

Then I've tried other zpool configs: with two 8+2 raidz2 vdevs, four
13+2 raidz2 vdevs,  six 8+2 raidz2 vdevs etc - the same problem -
reads are 5 times slower then writes. The only exception is some small
vdev (f.e. single 8+2 raidz2) - then reads are closer to writes.
I've tried zfs-0.7.5/zfs-0.7.7 - did not help.
Tried also simple striped vdevs with just 15 drives - didnot help
(Obdfilter-survey for case=disk from oss1
ost  1 sz 167936000K rsz 1024K obj    1 thr    1 write 2982.54
[2586.11, 3079.11] rewrite 2875.67 [1416.86, 3203.13] read  159.22 [
61.99,  481.96])

Example "zpool iostat -v 5" during reads:

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
l2oss1   976G   135T    184     15   183M  60.8K
  sdb       65.6G  9.00T     11      0  11.2M  2.40K
  sdc       64.9G  9.00T     12      1  12.4M  6.40K
  sdd       65.0G  9.00T     12      0  13.0M  3.20K
  sde       64.8G  9.00T     12      0  12.2M  3.20K
  sdf       64.7G  9.00T     12      1  12.4M  4.80K
  sdg       64.8G  9.00T     12      0  12.8M  2.40K
  sdh       66.1G  9.00T     12      0  12.6M  2.40K
  sdi       64.6G  9.00T     12      1  11.8M  5.60K
  sdj       64.6G  9.00T     11      0  11.2M  4.00K
  sdk       64.6G  9.00T     12      0  12.8M  2.40K
  sdaa      66.1G  9.00T     11      0  11.4M  4.00K
  sdab      65.0G  9.00T     11      1  11.8M  5.60K
  sdac      64.8G  9.00T     12      1  12.4M  6.40K
  sdad      65.6G  9.00T     12      0  12.6M  4.00K
  sdae      65.3G  9.00T     12      0  12.6M  4.00K
----------  -----  -----  -----  -----  -----  -----

"top" shows only 8 "z_rd_int" processes - and only one "z_rd_int"
running (while there were 32 running z_wr_iss processes during write,
wait io = 2-4% in both reads/writes)
Tried with prefetch_disable=1 - didnot help.
Tried vdevs from different drives (even on separate expanders) - didnot help.

sgpdd-survey shown all drives in range of 220-240GB/s.

Left only one sas channel to jbod, no multipath, no vdev aliases - didnot help.

So I'm completely stuck.
Could you please help?

Thank you very much in advance,
Alex