[lustre-discuss] Avoiding system cache when using ssd pfl extent

Andreas Dilger adilger at whamcloud.com
Fri May 20 16:02:07 PDT 2022

in this particular case I can answer your question in detail.

Before SFAOS 12.1 (IIRC) the /sys/block/*/queue/rotational setting is set from userspace at mount time via a udev script, and the Lustre detection of "rotational=0" could be racy.  Newer versions of SFAOS (12.1+) set the rotational state in the SCSI VPD page and this is detected directly by the kernel.

For EXAScaler systems that may be running older SFAOS releases, there was a patch made (included in 2.12.6-ddn72/EXA5.2.5) that revalidates the rotational device state occasionally in case it has been modified after mount time, and uses that to update the read_cache_enable and writethrough_cache_enable tunables *if they have not been explicitly set*.

Until you update to a newer EXA and/or SFAOS, you can explicitly tune osd-ldiskfs.*.read_cache_enable=0 and ...writethrough_cache_enable=0, using a wildcard "*" if all of the OSTs/MDTs are flash based.  If you have a hybrid NVMe/HDD system, you can explicitly select a subset of OST/MDT devices to disable the caches.

Cheers, Andreas

On May 20, 2022, at 02:49, Åke Sandgren <ake.sandgren at hpc2n.umu.se<mailto:ake.sandgren at hpc2n.umu.se>> wrote:
On 5/20/22 09:53, Andreas Dilger via lustre-discuss wrote:
To elaborate a bit on Patrick's answer, there is no mechanism to do this on the *client*, because the performance difference between client RAM and server storage is still fairly significant, especially if the application is doing sub-page read or write operations.
However, on the *server* the OSS and MDS will *not* put flash storage into the page cache, because using the kernel page cache has a measurable overhead, and (at least in our testing) the performance of NVMe IOPS is actually better *without* the page cache because more CPU is available to handle RPCs.  This is controlled on the server with osd-ldiskfs.*.{read_cache_enable,writethrough_cache_enable}, default to 0 if the block device is non-rotational, default to 1 if block device is rotational.

Then my question is, what is it checking to determine non-rotational?

On our systems the NVMe disks have read/writethrough_cache_enable = 1 (DDN SFA400NVXE) with
/dev/sde on /lustre/stor10/ost0000 (NVMe)
cat /sys/block/sde/queue/rotational
lctl get_param osd-ldiskfs.*.*cache*enable

EXAScaler SFA CentOS 5.2.3-r5

Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: ake at hpc2n.umu.se<mailto:ake at hpc2n.umu.se>  Mobile: +46 70 7716134  Fax: +46 90-580 14
WWW: http://www.hpc2n.umu.se
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>

Cheers, Andreas
Andreas Dilger
Lustre Principal Architect

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220520/e3678431/attachment.html>

More information about the lustre-discuss mailing list