[lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?
Cameron Harr
harr1 at llnl.gov
Wed Jan 10 12:17:36 PST 2024
On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
> Actually we had MDTs on software raid-1 *connecting two JBODs* for
> quite some time - worked surprisingly well and stable.
I'm glad it's working for you!
>
> Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't
> going towards raidz2 increase data safety, something you don't need if
> the SSDs anyhow never fail? Doesn't raidz2 protect against failure of
> *any* two disks - in a pool of mirrors the second failure could
> destroy one mirror?
>
With raidz2 you can replace any disk in the raid group, but there's also
a lot more drives that can fail. With mirrors, there's a 1:1 replacement
ratio with essentially no rebuild time. Of course that assumes the 2
drives you lost weren't the 2 drives in the same mirror, but we consider
that low-probability. ZFS is also smart enough to (try to) suspend the
pool if if it loses too many devices. And, the striped mirrors may see
better performance over Z2.
>
> Regards
> Thomas
>
> On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:
>> Thomas,
>>
>> We value management over performance and have knowingly left
>> performance on the floor in the name of standardization, robustness,
>> management, etc; while still maintaining our performance targets. We
>> are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID,
>> which, IMO, is very far behind ZoL in enterprise storage features.
>>
>> As Jeff mentioned, we have done some tuning (and if you haven't
>> noticed there are *a lot* of possible ZFS parameters) to further
>> improve performance and are at a good place performance-wise.
>>
>> Cameron
>>
>> On 1/8/24 10:33, Jeff Johnson wrote:
>>> Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
>>> close the gap somewhat with tuning, zfs ashift/recordsize and special
>>> allocation class vdevs. While the IOPs performance favors
>>> nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
>>> of ZFS and the security it provides to the most critical function in a
>>> Lustre file system shouldn't be undervalued. From personal experience,
>>> I'd much rather deal with zfs in the event of a seriously jackknifed
>>> MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
>>> to trying to unscramble a vendor blackbox hwraid volume. ;-)
>>>
>>> When zfs directio lands and is fully integrated into Lustre the
>>> performance differences *should* be negligible.
>>>
>>> Just my $.02 worth
>>>
>>> On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
>>> <lustre-discuss at lists.lustre.org> wrote:
>>>> Hi Cameron,
>>>>
>>>> did you run a performance comparison between ZFS and mdadm-raid on
>>>> the MDTs?
>>>> I'm currently doing some tests, and the results favor software
>>>> raid, in particular when it comes to IOPS.
>>>>
>>>> Regards
>>>> Thomas
>>>>
>>>> On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
>>>>> This doesn't answer your question about ldiskfs on zvols, but
>>>>> we've been running MDTs on ZFS on NVMe in production for a couple
>>>>> years (and on SAS SSDs for many years prior). Our current
>>>>> production MDTs using NVMe consist of one zpool/node made up of 3x
>>>>> 2-drive mirrors, but we've been experimenting lately with using
>>>>> raidz3 and possibly even raidz2 for MDTs since SSDs have been
>>>>> pretty reliable for us.
>>>>>
>>>>> Cameron
>>>>>
>>>>> On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology,
>>>>> Inc.] via lustre-discuss wrote:
>>>>>> We are in the process of retiring two long standing LFS's (about
>>>>>> 8 years old), which we built and managed ourselves. Both use ZFS
>>>>>> and have the MDT'S on ssd's in a JBOD that require the kind of
>>>>>> software-based management you describe, in our case ZFS pools
>>>>>> built on multipath devices. The MDT in one is ZFS and the MDT in
>>>>>> the other LFS is ldiskfs but uses ZFS and a zvol as you describe
>>>>>> - we build the ldiskfs MDT on top of the zvol. Generally, this
>>>>>> has worked well for us, with one big caveat. If you look for my
>>>>>> posts to this list and the ZFS list you'll find more details.
>>>>>> The short version is that we utilize ZFS snapshots and clones to
>>>>>> do backups of the metadata. We've run into situations where the
>>>>>> backup process stalls, leaving a clone hanging around. We've
>>>>>> experienced a situation a couple of times where the clone and the
>>>>>> primary zvol get swapped, effectively rolling back our metadata
>>>>>> to the point when the clone was created. I have tried,
>>>>>> unsuccessfully, to recreate
>>>>>> that in a test environment. So if you do that kind of setup,
>>>>>> make sure you have good monitoring in place to detect if your
>>>>>> backups/clones stall. We've kept up with lustre and ZFS updates
>>>>>> over the years and are currently on lustre 2.14 and ZFS 2.1.
>>>>>> We've seen the gap between our ZFS MDT and ldiskfs performance
>>>>>> shrink to the point where they are pretty much on par to each
>>>>>> now. I think our ZFS MDT performance could be better with more
>>>>>> hardware and software tuning but our small team hasn't had the
>>>>>> bandwidth to tackle that.
>>>>>>
>>>>>> Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at
>>>>>> liberty to talk about the proprietary way those devices are
>>>>>> managed. However, the metadata performance is SO much better
>>>>>> than our older LFS's, for a lot of reasons, but I'd highly
>>>>>> recommend NVMe's for your MDT's.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org
>>>>>> <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of
>>>>>> Thomas Roth via lustre-discuss <lustre-discuss at lists.lustre.org
>>>>>> <mailto:lustre-discuss at lists.lustre.org>>
>>>>>> Reply-To: Thomas Roth <t.roth at gsi.de <mailto:t.roth at gsi.de>>
>>>>>> Date: Friday, January 5, 2024 at 9:03 AM
>>>>>> To: Lustre Diskussionsliste <lustre-discuss at lists.lustre.org
>>>>>> <mailto:lustre-discuss at lists.lustre.org>>
>>>>>> Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
>>>>>>
>>>>>>
>>>>>> CAUTION: This email originated from outside of NASA. Please take
>>>>>> care when clicking links or opening attachments. Use the "Report
>>>>>> Message" button to report suspicious messages to the NASA SOC.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>>
>>>>>> considering NVME storage for the next MDS.
>>>>>>
>>>>>>
>>>>>> As I understand, NVME disks are bundled in software, not by a
>>>>>> hardware raid controller.
>>>>>> This would be done using Linux software raid, mdadm, correct?
>>>>>>
>>>>>>
>>>>>> We have some experience with ZFS, which we use on our OSTs.
>>>>>> But I would like to stick to ldiskfs for the MDTs, and a zpool
>>>>>> with a zvol on top which is then formatted with ldiskfs - to much
>>>>>> voodoo...
>>>>>>
>>>>>>
>>>>>> How is this handled elsewhere? Any experiences?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The available devices are quite large. If I create a raid-10 out
>>>>>> of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close
>>>>>> to the 16 TB limit.
>>>>>> So no need for a box with lots of U.3 slots.
>>>>>>
>>>>>>
>>>>>> But for MDS operations, we will still need a powerful dual-CPU
>>>>>> system with lots of RAM.
>>>>>> Then the NVME devices should be distributed between the CPUs?
>>>>>> Is there a way to pinpoint this in a call for tender?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------
>>>>>> Thomas Roth
>>>>>>
>>>>>>
>>>>>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>>>>>> Planckstraße 1, 64291 Darmstadt, Germany,
>>>>>> https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
>>>>>> <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$
>>>>>> >
>>>>>>
>>>>>>
>>>>>> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB
>>>>>> 1528
>>>>>> Managing Directors / Geschäftsführung:
>>>>>> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>>>>>> Chairman of the Supervisory Board / Vorsitzender des
>>>>>> GSI-Aufsichtsrats:
>>>>>> State Secretary / Staatssekretär Dr. Volkmar Dietz
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org
>>>>>> <mailto:lustre-discuss at lists.lustre.org>
>>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
>>>>>> <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org
>>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$
>>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$
>>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$
>>>>
>>>
>>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$
More information about the lustre-discuss
mailing list