[lustre-discuss] How to speed up Lustre

Andreas Dilger adilger at whamcloud.com
Wed Jul 6 15:57:28 PDT 2022


On Jul 6, 2022, at 14:50, Thomas Roth <t.roth at gsi.de<mailto:t.roth at gsi.de>> wrote:

Yes, I got it.
But Marion states that they switched
> to a PFL arrangement, where the first 64k lives on flash OST's (mounted on our metadata servers), and the remainder of larger files lives on HDD OST's.

So, how do you specify a particular OSTs (or group of OSTs) in a PFL?
The OST-equivalent of the "-L mdt" part ?

With SSDs and HDDs making up the OSTs, I would have guessed OST pools, but I'm only aware of a "lfs setstripe" that puts all of my file into a pool. How to put the first few kB of a file in pool A and the rest in pool B ?

To create named subsets of OSTs (e.g. flash vs. disk) you need to create an OST pool, and then specify the OST pool name for the PFL components of the file:

mgs# lctl pool_new fsname.nvme
mgs# lctl pool_add fsname.nvme OST[0x0000-0x0004]
mgs# lctl pool_new fsname.disk
mgs# lctl pool_add fsname.disk OST[0x0005-0x000f]

client# lfs setstripe -E 1M -c 1 --pool nvme -E 1G --pool disk -E 16G -c 4 -E eof -c 32 <dir>

The parameters from each prior component are inherited unless replaced, so "-c 1" is inherited by the second component, and "--pool disk" is inherited by the third and fourth components, so does not need to be specified each time.  Since the PFL components are fully independent file layouts, all of the parameters can be specified for each component separately (stripe count, stripe size, pool name).

Note that using a too-large PFL layout (more than 3-4 components) can be counter-productive as it may cause the layout for even small files to spill into an external xattr block and consume extra space and IOPS on the MDT.

Cheers, Andreas




Cheers
Thomas


On 7/6/22 21:42, Andreas Dilger wrote:
Thomas,
where the file data is stored depends entirely on the PFL layout used for the filesystem or parent directory.
For DoM files, you need to specify a DoM component, like:
    lfs setstripe -E 64K -L mdt -E 1G -c 1 -E 16G -c 4 -E eof -c 32 <dir>
so the first 64KB will be put onto the MDT where the file is created, the remaining 1GB onto a single OST, the next 15GB striped across 4 OSTs, and the rest of the file striped across (up to) 32 OSTs.
64KB is the minimum DoM component size, but if the files are smaller (e.g. 3KB) they will only allocate space on the MDT in multiples of 4KB blocks.  However, the default ldiskfs MDT formatting only leaves about 1 KB of space per inode, which would quickly run out unless DoM is restricted to specific directories with small files, or if the MDT is formatted with enough free space to accommodate this usage.  This is less of an issue with ZFS MDTs, but DoM files will still consume space much more quickly and reduce the available inode count by a factor of 16-64 more quickly than without DoM.
It is strongly recommended to use Lustre 2.15 with DoM to benefit from the automatic MDT space balancing, otherwise the MDT usage may become imbalanced if the admin (or users) do not actively manage the MDT selection for new user/project/job directories with "lfs mkdir -i".
Cheers, Andreas
On Jul 6, 2022, at 10:48, Thomas Roth via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>> wrote:
Hi Marion,
I do not fully understand how to "mount flash OSTs on a metadata server"
- You have a couple of SSDs, you assemble these into on block device and format it with "mkfs.lustre --ost ..." ? And then mount it just as any other OST?
- PFL then puts the first 64k on these OSTs and the rest of all files on the HDD-based OSTs?
So, no magic on the MDS?
I'm asking because we are considering something similar, but we would not have these flash-OSTs in the MDS-hardware but on separate OSS servers.
Regards,
Thomas
On 23/02/2022 04.35, Marion Hakanson via lustre-discuss wrote:
Hi again,
karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr><mailto:karagol at aselsan.com.tr> said:
I was thinking that DoM is built in feature and it can be enabled/disabled
online for a certain directories. What do you mean by reformat to converting
to DoM (or away from it). I think just Metadata target size is important.
When we first turned on DoM, it's likely that our Lustre system was old
enough to need to be reformatted in order to support it.  Our flash
storage RAID configuration also needed to be expanded, but the system
was not yet in production so a reformat was no big deal at the time.
So perhaps your system will not be subject to this requirement (other
than expanding your MDT flash somehow).
karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr><mailto:karagol at aselsan.com.tr> said:
I also thought creating flash OST on metadata server. But I was not sure what
to install on metadata server for this purpose. Can Metadata server be an OSS
server at the same time? If it is possible I would prefer flash OST on
Metadata server instead of DoM. Because Our metadata target size is small, it
seems I have to do risky operations to expand size.
Yes, our metadata servers are also OSS's at the same time.  The flash
OST's are separate volumes (and drives) from the MDT's, so less scary (:-).
karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr><mailto:karagol at aselsan.com.tr> said:
imho, because of the less RPC traffic DoM shows more performance than flash
OST. Am I right?
The documentation does say there that using DoM for small files will produce
less RPC traffic than using OST's for small files.
But as I said earlier, for us, the amount of flash needed to support DoM
was a lot higher than with the flash OST approach (we have a high percentage,
by number, of small files).
I'll also note that we had a wish to mostly "set and forget" the layout
for our Lustre filesystem.  We have not figured out a way to predict
or control where small files (or large ones) are going to end up, so
trying to craft optimal layouts in particular directories for particular
file sizes has turned out to not be feasible for us.  PFL has been a
win for us here, for that reason.
Our conclusion was that in order to take advantage of the performance
improvements of DoM, you need enough money for lots of flash, or you need
enough staff time to manage the DoM layouts to fit into that flash.
We have neither of those conditions, and we find that using PFL and
flash OST's for small files is working very well for us.
Regards,
Marion
From: =?utf-8?B?VGFuZXIgS0FSQUfDlkw=?= <karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr><mailto:karagol at aselsan.com.tr>>
To: Marion Hakanson <hakansom at ohsu.edu<mailto:hakansom at ohsu.edu><mailto:hakansom at ohsu.edu>>
CC: "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>" <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>>
Date: Tue, 22 Feb 2022 04:53:03 +0000
UNCLASSIFIED
Thank you for sharing your experience.
I was thinking that DoM is built in feature and it can be enabled/disabled online for a certain directories. What do you mean by reformat to converting to DoM (or away from it). I think just Metadata target size is important.
I also thought creating flash OST on metadata server. But I was not sure what to install on metadata server for this purpose. Can Metadata server be an OSS server at the same time? If it is possible I would prefer flash OST on Metadata server instead of DoM. Because Our metadata target size is small, it seems I have to do risky operations to expand size.
imho, because of the less RPC traffic DoM shows more performance than flash OST. Am I right?
Best Regards;
From: Marion Hakanson <hakansom at ohsu.edu<mailto:hakansom at ohsu.edu><mailto:hakansom at ohsu.edu>>
Sent: Thursday, February 17, 2022 8:20 PM
To: Taner KARAGÖL <karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr><mailto:karagol at aselsan.com.tr>>
Cc: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] How to speed up Lustre
We started with DoM on our new Lustre system a couple years ago.
  - Converting to DoM (or away from it) is a full-reformat operation.
  - DoM uses a fixed amount of metadata space (64k minimum for us) for every file, even those smaller than 64k.
Basically, DoM uses a lot of flash metadata space, more than we planned for, and more than we could afford.
We ended up switching to a PFL arrangement, where the first 64k lives on flash OST's (mounted on our metadata servers), and the remainder of larger files lives on HDD OST's.  This is working very well for our small-file workloads, and uses less flash space than the DoM configuration did.
Since you don't already have DoM in effect, it may be possible that you could add flash OST's, configure a PFL, and then use "lfs migrate" to re-layout existing files into the new OST's.  Your mileage may vary, so be safe!
Regards,
Marion
On Feb 14, 2022, at 03:32, Taner KARAGÖL via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>> wrote:

UNCLASSIFIED
Hi Everybody;
We have a performance problem with small files on our HPC system (120 compute nodes). Our all OSS targets are classic spinning HDDs. To speed up, I want to configure Data on Metadata. Our metadata target has SDD disks.
Underlying file systems are ZFS (for OSS and Meta)
Lustre version: 2.12.5
ZFS version: .0.7.13
Our Lustre file system size is 720TB (2 OSS servers, 1 enclosure with 6 zpools), Metadata file system size is 2.1TB(1 enclosure and 1 metadata target).
What is your opinions to speed up this setup? I want to configure DoM but I am concerning about Metadata size. My questions:
  1.  How can I increase Medatadata size? Metadata enclosure has a empty slots. Is there a way to increase size online/offline?
  2.  Is it possible to migrate big files from DoM to OSS targets completely? Off course online migration. (So I think I can free Metadata for new small files).
Best Regards;
Taner
________________________________
Dikkat:
Bu elektronik posta mesaji kisisel ve ozeldir. Eger size gonderilmediyse lutfen gondericiyi bilgilendirip mesaji siliniz. Firmamiza gelen ve giden mesajlar virus taramasindan gecirilmekte, guvenlik nedeni ile kontrol edilerek saklanmaktadir. Mesajdaki gorusler ve bakis acisi gondericiye ait olup Aselsan A.S. resmi gorusu olmak zorunda degildir.
________________________________
Attention:
This e-mail message is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. E-mails to and from the company are monitored for operational reasons and in accordance with lawful business practices. Any views or opinions presented are solely those of the author and do not necessarily represent the views of the company.
________________________________
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Mi0JBg!bW2FnSTRNdX7DpkjIiMayeexmYJ3D5Xt7wtneny2zgGi1ZXPcy7QMRlM3mno-HWR$<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Mi0JBg!bW2FnSTRNdX7DpkjIiMayeexmYJ3D5Xt7wtneny2zgGi1ZXPcy7QMRlM3mno-HWR$>
######################################################################
Dikkat:
Bu elektronik posta mesaji kisisel ve ozeldir. Eger size
gonderilmediyse lutfen gondericiyi bilgilendirip mesaji siliniz.
Firmamiza gelen ve giden mesajlar virus taramasindan gecirilmekte,
guvenlik nedeni ile kontrol edilerek saklanmaktadir. Mesajdaki
gorusler ve bakis acisi gondericiye ait olup Aselsan A.S. resmi
gorusu olmak zorunda degildir.
######################################################################
Attention:
This e-mail message is privileged and confidential. If you are
not the intended recipient please delete the message and notify
the sender. E-mails to and from the company are monitored for
operational reasons and in accordance with lawful business practices.
Any views or opinions presented are solely those of the author and
do not necessarily represent the views of the company.
######################################################################
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--
--------------------------------------------------------------------
Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de<http://www.gsi.de>
Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220706/f31d10a2/attachment-0001.htm>


More information about the lustre-discuss mailing list