[lustre-discuss] How to speed up Lustre

Wed Jul 6 23:56:52 PDT 2022

I haven't tried it, but the man page for setstripe  --pool explains it.

Cheers,
Hans Henrik

On 06.07.2022 22.50, Thomas Roth via lustre-discuss wrote:
> Yes, I got it.
> But Marion states that they switched
> > to a PFL arrangement, where the first 64k lives on flash OST's 
> (mounted on our metadata servers), and the remainder of larger files 
> lives on HDD OST's.
>
> So, how do you specify a particular OSTs (or group of OSTs) in a PFL?
> The OST-equivalent of the "-L mdt" part ?
>
> With SSDs and HDDs making up the OSTs, I would have guessed OST pools, 
> but I'm only aware of a "lfs setstripe" that puts all of my file into 
> a pool. How to put the first few kB of a file in pool A and the rest 
> in pool B ?
>
>
> Cheers
> Thomas
>
>
> On 7/6/22 21:42, Andreas Dilger wrote:
>> Thomas,
>> where the file data is stored depends entirely on the PFL layout used 
>> for the filesystem or parent directory.
>>
>> For DoM files, you need to specify a DoM component, like:
>>
>>      lfs setstripe -E 64K -L mdt -E 1G -c 1 -E 16G -c 4 -E eof -c 32 
>> <dir>
>>
>> so the first 64KB will be put onto the MDT where the file is created, 
>> the remaining 1GB onto a single OST, the next 15GB striped across 4 
>> OSTs, and the rest of the file striped across (up to) 32 OSTs.
>>
>> 64KB is the minimum DoM component size, but if the files are smaller 
>> (e.g. 3KB) they will only allocate space on the MDT in multiples of 
>> 4KB blocks.  However, the default ldiskfs MDT formatting only leaves 
>> about 1 KB of space per inode, which would quickly run out unless DoM 
>> is restricted to specific directories with small files, or if the MDT 
>> is formatted with enough free space to accommodate this usage.  This 
>> is less of an issue with ZFS MDTs, but DoM files will still consume 
>> space much more quickly and reduce the available inode count by a 
>> factor of 16-64 more quickly than without DoM.
>>
>> It is strongly recommended to use Lustre 2.15 with DoM to benefit 
>> from the automatic MDT space balancing, otherwise the MDT usage may 
>> become imbalanced if the admin (or users) do not actively manage the 
>> MDT selection for new user/project/job directories with "lfs mkdir -i".
>>
>> Cheers, Andreas
>>
>> On Jul 6, 2022, at 10:48, Thomas Roth via lustre-discuss 
>> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> 
>> wrote:
>>
>> Hi Marion,
>>
>> I do not fully understand how to "mount flash OSTs on a metadata server"
>> - You have a couple of SSDs, you assemble these into on block device 
>> and format it with "mkfs.lustre --ost ..." ? And then mount it just 
>> as any other OST?
>> - PFL then puts the first 64k on these OSTs and the rest of all files 
>> on the HDD-based OSTs?
>> So, no magic on the MDS?
>>
>> I'm asking because we are considering something similar, but we would 
>> not have these flash-OSTs in the MDS-hardware but on separate OSS 
>> servers.
>>
>>
>> Regards,
>> Thomas
>>
>> On 23/02/2022 04.35, Marion Hakanson via lustre-discuss wrote:
>> Hi again,
>> karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr> said:
>> I was thinking that DoM is built in feature and it can be 
>> enabled/disabled
>> online for a certain directories. What do you mean by reformat to 
>> converting
>> to DoM (or away from it). I think just Metadata target size is 
>> important.
>> When we first turned on DoM, it's likely that our Lustre system was old
>> enough to need to be reformatted in order to support it.  Our flash
>> storage RAID configuration also needed to be expanded, but the system
>> was not yet in production so a reformat was no big deal at the time.
>> So perhaps your system will not be subject to this requirement (other
>> than expanding your MDT flash somehow).
>> karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr> said:
>> I also thought creating flash OST on metadata server. But I was not 
>> sure what
>> to install on metadata server for this purpose. Can Metadata server 
>> be an OSS
>> server at the same time? If it is possible I would prefer flash OST on
>> Metadata server instead of DoM. Because Our metadata target size is 
>> small, it
>> seems I have to do risky operations to expand size.
>> Yes, our metadata servers are also OSS's at the same time.  The flash
>> OST's are separate volumes (and drives) from the MDT's, so less scary 
>> (:-).
>> karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr> said:
>> imho, because of the less RPC traffic DoM shows more performance than 
>> flash
>> OST. Am I right?
>> The documentation does say there that using DoM for small files will 
>> produce
>> less RPC traffic than using OST's for small files.
>> But as I said earlier, for us, the amount of flash needed to support DoM
>> was a lot higher than with the flash OST approach (we have a high 
>> percentage,
>> by number, of small files).
>> I'll also note that we had a wish to mostly "set and forget" the layout
>> for our Lustre filesystem.  We have not figured out a way to predict
>> or control where small files (or large ones) are going to end up, so
>> trying to craft optimal layouts in particular directories for particular
>> file sizes has turned out to not be feasible for us.  PFL has been a
>> win for us here, for that reason.
>> Our conclusion was that in order to take advantage of the performance
>> improvements of DoM, you need enough money for lots of flash, or you 
>> need
>> enough staff time to manage the DoM layouts to fit into that flash.
>> We have neither of those conditions, and we find that using PFL and
>> flash OST's for small files is working very well for us.
>> Regards,
>> Marion
>> From: =?utf-8?B?VGFuZXIgS0FSQUfDlkw=?= 
>> <karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr>>
>> To: Marion Hakanson <hakansom at ohsu.edu<mailto:hakansom at ohsu.edu>>
>> CC: 
>> "lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>" 
>> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> 
>>
>> Date: Tue, 22 Feb 2022 04:53:03 +0000
>>
>> UNCLASSIFIED
>>
>> Thank you for sharing your experience.
>>
>> I was thinking that DoM is built in feature and it can be 
>> enabled/disabled online for a certain directories. What do you mean 
>> by reformat to converting to DoM (or away from it). I think just 
>> Metadata target size is important.
>>
>> I also thought creating flash OST on metadata server. But I was not 
>> sure what to install on metadata server for this purpose. Can 
>> Metadata server be an OSS server at the same time? If it is possible 
>> I would prefer flash OST on Metadata server instead of DoM. Because 
>> Our metadata target size is small, it seems I have to do risky 
>> operations to expand size.
>>
>> imho, because of the less RPC traffic DoM shows more performance than 
>> flash OST. Am I right?
>>
>> Best Regards;
>>
>>
>> From: Marion Hakanson <hakansom at ohsu.edu<mailto:hakansom at ohsu.edu>>
>> Sent: Thursday, February 17, 2022 8:20 PM
>> To: Taner KARAGÖL 
>> <karagol at aselsan.com.tr<mailto:karagol at aselsan.com.tr>>
>> Cc: 
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> Subject: Re: [lustre-discuss] How to speed up Lustre
>>
>> We started with DoM on our new Lustre system a couple years ago.
>>    - Converting to DoM (or away from it) is a full-reformat operation.
>>    - DoM uses a fixed amount of metadata space (64k minimum for us) 
>> for every file, even those smaller than 64k.
>>
>> Basically, DoM uses a lot of flash metadata space, more than we 
>> planned for, and more than we could afford.
>>
>> We ended up switching to a PFL arrangement, where the first 64k lives 
>> on flash OST's (mounted on our metadata servers), and the remainder 
>> of larger files lives on HDD OST's.  This is working very well for 
>> our small-file workloads, and uses less flash space than the DoM 
>> configuration did.
>>
>> Since you don't already have DoM in effect, it may be possible that 
>> you could add flash OST's, configure a PFL, and then use "lfs 
>> migrate" to re-layout existing files into the new OST's. Your mileage 
>> may vary, so be safe!
>>
>> Regards,
>>
>> Marion
>>
>>
>>
>> On Feb 14, 2022, at 03:32, Taner KARAGÖL via lustre-discuss 
>> <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org>> 
>> wrote:
>> ï»¿
>> UNCLASSIFIED
>>
>> Hi Everybody;
>>
>> We have a performance problem with small files on our HPC system (120 
>> compute nodes). Our all OSS targets are classic spinning HDDs. To 
>> speed up, I want to configure Data on Metadata. Our metadata target 
>> has SDD disks.
>>
>> Underlying file systems are ZFS (for OSS and Meta)
>> Lustre version: 2.12.5
>> ZFS version: .0.7.13
>>
>> Our Lustre file system size is 720TB (2 OSS servers, 1 enclosure with 
>> 6 zpools), Metadata file system size is 2.1TB(1 enclosure and 1 
>> metadata target).
>>
>> What is your opinions to speed up this setup? I want to configure DoM 
>> but I am concerning about Metadata size. My questions:
>>
>>    1.  How can I increase Medatadata size? Metadata enclosure has a 
>> empty slots. Is there a way to increase size online/offline?
>>    2.  Is it possible to migrate big files from DoM to OSS targets 
>> completely? Off course online migration. (So I think I can free 
>> Metadata for new small files).
>>
>> Best Regards;
>> Taner
>> ________________________________
>> Dikkat:
>>
>> Bu elektronik posta mesaji kisisel ve ozeldir. Eger size 
>> gonderilmediyse lutfen gondericiyi bilgilendirip mesaji siliniz. 
>> Firmamiza gelen ve giden mesajlar virus taramasindan gecirilmekte, 
>> guvenlik nedeni ile kontrol edilerek saklanmaktadir. Mesajdaki 
>> gorusler ve bakis acisi gondericiye ait olup Aselsan A.S. resmi 
>> gorusu olmak zorunda degildir.
>>
>> ________________________________
>> Attention:
>>
>> This e-mail message is privileged and confidential. If you are not 
>> the intended recipient please delete the message and notify the 
>> sender. E-mails to and from the company are monitored for operational 
>> reasons and in accordance with lawful business practices. Any views 
>> or opinions presented are solely those of the author and do not 
>> necessarily represent the views of the company.
>>
>> ________________________________
>>
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org><mailto:lustre-discuss at lists.lustre.org> 
>>
>> https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Mi0JBg!bW2FnSTRNdX7DpkjIiMayeexmYJ3D5Xt7wtneny2zgGi1ZXPcy7QMRlM3mno-HWR$<https://urldefense.com/v3/__http:/lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Mi0JBg!bW2FnSTRNdX7DpkjIiMayeexmYJ3D5Xt7wtneny2zgGi1ZXPcy7QMRlM3mno-HWR$> 
>>
>>
>> ######################################################################
>> Dikkat:
>>
>> Bu elektronik posta mesaji kisisel ve ozeldir. Eger size
>> gonderilmediyse lutfen gondericiyi bilgilendirip mesaji siliniz.
>> Firmamiza gelen ve giden mesajlar virus taramasindan gecirilmekte,
>> guvenlik nedeni ile kontrol edilerek saklanmaktadir. Mesajdaki
>> gorusler ve bakis acisi gondericiye ait olup Aselsan A.S. resmi
>> gorusu olmak zorunda degildir.
>>
>> ######################################################################
>> Attention:
>>
>> This e-mail message is privileged and confidential. If you are
>> not the intended recipient please delete the message and notify
>> the sender. E-mails to and from the company are monitored for
>> operational reasons and in accordance with lawful business practices.
>> Any views or opinions presented are solely those of the author and
>> do not necessarily represent the views of the company.
>>
>> ######################################################################
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> -- 
>> --------------------------------------------------------------------
>> Thomas Roth
>> Department: Informationstechnologie
>> Location: SB3 2.291
>>
>>
>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>> Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de<http://www.gsi.de>
>>
>> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
>> Managing Directors / Geschäftsführung:
>> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>> Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
>> State Secretary / Staatssekretär Dr. Volkmar Dietz
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> Cheers, Andreas
>> -- 
>> Andreas Dilger
>> Lustre Principal Architect
>> Whamcloud
>>
>>
>>
>>
>>
>>
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220707/dccec650/attachment-0001.htm>