[lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

Mon Jan 8 07:23:21 PST 2024

Our setup has a single JBOD connected to 2 servers but the JBOD has dual controllers.  Each server connects to both controllers for redundancy so there are 4 connections to each server.  So we have a paired HA setup where one peer node can take over the OSTs/MDTs of its peer node.  Some specifics on our hardware:

Supermicro twin servers:
https://www.supermicro.com/products/archive/system/sys-6027tr-d71frf

JBOD:
https://www.supermicro.com/products/archive/chassis/sc946ed-r2kjbod

Each pair can “zpool import” all pools from either pair.  Here is an excerpt from our ldev.conf file

#local  foreign/-  label       [md|zfs:]device-path   [journal-path]/- [raidtab]

# primary hpfs-fsl (aka /nobackup) lustre file system
hpfs-fsl-mds0.fsl.jsc.nasa.gov  hpfs-fsl-mds1.fsl.jsc.nasa.gov  hpfs-fsl-MDT0000  zfs:mds0-0/meta-fsl

hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-OST0000  zfs:oss00-0/ost-fsl
hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-OST000c  zfs:oss00-1/ost-fsl

hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-OST0001  zfs:oss01-0/ost-fsl
hpfs-fsl-oss01.fsl.jsc.nasa.gov hpfs-fsl-oss00.fsl.jsc.nasa.gov hpfs-fsl-OST000d  zfs:oss01-1/ost-fsl

If you wanted to fail oss01’s OST’s over to oss00, you’d do a “service lustre stop” on oss01 followed by a “service lustre start foreign” on oss00.  This setup has been stable and has served us well for a long time.  Our servers are stable enough that we never set up automated failover via corosync or something similar.

From: Vinícius Ferrão <ferrao at versatushpc.com.br>
Date: Sunday, January 7, 2024 at 12:06 PM
To: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1 at nasa.gov>
Cc: Thomas Roth <t.roth at gsi.de>, Lustre Diskussionsliste <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

CAUTION: This email originated from outside of NASA.  Please take care when clicking links or opening attachments.  Use the "Report Message" button to report suspicious messages to the NASA SOC.

Hi Vicker may I ask if you have any kind of HA on this setup?

If yes I’m interested on how the ZFS pools would migrate from one server to another in case of failure. I’m considering the typical lustre deployment were you have two servers attached to two JBODs using a multipath SAS topology with crossed cables: |X|.

I can easily understand that when you have Hardware RAID running on the JBOD and SAS HBA on the servers, but for a total software solution I’m unaware how that will work effectively.

Thank you.

On 5 Jan 2024, at 14:07, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:

We are in the process of retiring two long standing LFS's (about 8 years old), which we built and managed ourselves.  Both use ZFS and have the MDT'S on ssd's in a JBOD that require the kind of software-based management you describe, in our case ZFS pools built on multipath devices.  The MDT in one is ZFS and the MDT in the other LFS is ldiskfs but uses ZFS and a zvol as you describe - we build the ldiskfs MDT on top of the zvol.  Generally, this has worked well for us, with one big caveat.  If you look for my posts to this list and the ZFS list you'll find more details.  The short version is that we utilize ZFS snapshots and clones to do backups of the metadata.  We've run into situations where the backup process stalls, leaving a clone hanging around.  We've experienced a situation a couple of times where the clone and the primary zvol get swapped, effectively rolling back our metadata to the point when the clone was created.  I have tried, unsuccessfully, to recreate that in a test environment.  So if you do that kind of setup, make sure you have good monitoring in place to detect if your backups/clones stall.  We've kept up with lustre and ZFS updates over the years and are currently on lustre 2.14 and ZFS 2.1.  We've seen the gap between our ZFS MDT and ldiskfs performance shrink to the point where they are pretty much on par to each now.  I think our ZFS MDT performance could be better with more hardware and software tuning but our small team hasn't had the bandwidth to tackle that.

Our newest LFS is vendor provided and uses NVMe MDT's.  I'm not at liberty to talk about the proprietary way those devices are managed.  However, the metadata performance is SO much better than our older LFS's, for a lot of reasons, but I'd highly recommend NVMe's for your MDT's.

-----Original Message-----
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org> <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Thomas Roth via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> <mailto:lustre-discuss at lists.lustre.org>>
Reply-To: Thomas Roth <t.roth at gsi.de<mailto:t.roth at gsi.de> <mailto:t.roth at gsi.de>>
Date: Friday, January 5, 2024 at 9:03 AM
To: Lustre Diskussionsliste <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> <mailto:lustre-discuss at lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?

CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.

Dear all,

considering NVME storage for the next MDS.

As I understand, NVME disks are bundled in software, not by a hardware raid controller.
This would be done using Linux software raid, mdadm, correct?

We have some experience with ZFS, which we use on our OSTs.
But I would like to stick to ldiskfs for the MDTs, and a zpool with a zvol on top which is then formatted with ldiskfs - to much voodoo...

How is this handled elsewhere? Any experiences?

The available devices are quite large. If I create a raid-10 out of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close to the 16 TB limit.
So no need for a box with lots of U.3 slots.

But for MDS operations, we will still need a powerful dual-CPU system with lots of RAM.
Then the NVME devices should be distributed between the CPUs?
Is there a way to pinpoint this in a call for tender?

Best regards,
Thomas

--------------------------------------------------------------------
Thomas Roth

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, http://www.gsi.de/ <http://www.gsi.de/>

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Volkmar Dietz

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org> <mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org <http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org>

_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240108/4208bb6b/attachment-0001.htm>