[lustre-discuss] [EXTERNAL] [BULK] MDS hardware - NVME?

Wed Jan 10 12:17:36 PST 2024

On 1/10/24 11:59, Thomas Roth via lustre-discuss wrote:
> Actually we had MDTs on software raid-1 *connecting two JBODs* for 
> quite some time - worked surprisingly well and stable.

I'm glad it's working for you!

>
> Hmm, if you have your MDTs on a zpool of mirrors aka raid-10, wouldn't 
> going towards raidz2 increase data safety, something you don't need if 
> the SSDs anyhow never fail? Doesn't raidz2 protect against failure of 
> *any* two disks - in a pool of mirrors the second failure could 
> destroy one mirror?
>
With raidz2 you can replace any disk in the raid group, but there's also 
a lot more drives that can fail. With mirrors, there's a 1:1 replacement 
ratio with essentially no rebuild time. Of course that assumes the 2 
drives you lost weren't the 2 drives in the same mirror, but we consider 
that low-probability. ZFS is also smart enough to (try to) suspend the 
pool if if it loses too many devices. And, the striped mirrors may see 
better performance over Z2.

>
> Regards
> Thomas
>
> On 1/9/24 20:57, Cameron Harr via lustre-discuss wrote:
>> Thomas,
>>
>> We value management over performance and have knowingly left 
>> performance on the floor in the name of standardization, robustness, 
>> management, etc; while still maintaining our performance targets. We 
>> are a heavy ZFS-on-Linux (ZoL) shop so we never considered MD-RAID, 
>> which, IMO, is very far behind ZoL in enterprise storage features.
>>
>> As Jeff mentioned, we have done some tuning (and if you haven't 
>> noticed there are *a lot* of possible ZFS parameters) to further 
>> improve performance and are at a good place performance-wise.
>>
>> Cameron
>>
>> On 1/8/24 10:33, Jeff Johnson wrote:
>>> Today nvme/mdraid/ldiskfs will beat nvme/zfs on MDS IOPs but you can
>>> close the gap somewhat with tuning, zfs ashift/recordsize and special
>>> allocation class vdevs. While the IOPs performance favors
>>> nvme/mdraid/ldiskfs there are tradeoffs. The snapshot/backup abilities
>>> of ZFS and the security it provides to the most critical function in a
>>> Lustre file system shouldn't be undervalued. From personal experience,
>>> I'd much rather deal with zfs in the event of a seriously jackknifed
>>> MDT than mdraid/ldiskfs and both zfs and mdraid/ldiskfs are preferable
>>> to trying to unscramble a vendor blackbox hwraid volume. ;-)
>>>
>>> When zfs directio lands and is fully integrated into Lustre the
>>> performance differences *should* be negligible.
>>>
>>> Just my $.02 worth
>>>
>>> On Mon, Jan 8, 2024 at 8:23 AM Thomas Roth via lustre-discuss
>>> <lustre-discuss at lists.lustre.org> wrote:
>>>> Hi Cameron,
>>>>
>>>> did you run a performance comparison between ZFS and mdadm-raid on 
>>>> the MDTs?
>>>> I'm currently doing some tests, and the results favor software 
>>>> raid, in particular when it comes to IOPS.
>>>>
>>>> Regards
>>>> Thomas
>>>>
>>>> On 1/5/24 19:55, Cameron Harr via lustre-discuss wrote:
>>>>> This doesn't answer your question about ldiskfs on zvols, but 
>>>>> we've been running MDTs on ZFS on NVMe in production for a couple 
>>>>> years (and on SAS SSDs for many years prior). Our current 
>>>>> production MDTs using NVMe consist of one zpool/node made up of 3x 
>>>>> 2-drive mirrors, but we've been experimenting lately with using 
>>>>> raidz3 and possibly even raidz2 for MDTs since SSDs have been 
>>>>> pretty reliable for us.
>>>>>
>>>>> Cameron
>>>>>
>>>>> On 1/5/24 9:07 AM, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, 
>>>>> Inc.] via lustre-discuss wrote:
>>>>>> We are in the process of retiring two long standing LFS's (about 
>>>>>> 8 years old), which we built and managed ourselves.  Both use ZFS 
>>>>>> and have the MDT'S on ssd's in a JBOD that require the kind of 
>>>>>> software-based management you describe, in our case ZFS pools 
>>>>>> built on multipath devices.  The MDT in one is ZFS and the MDT in 
>>>>>> the other LFS is ldiskfs but uses ZFS and a zvol as you describe 
>>>>>> - we build the ldiskfs MDT on top of the zvol.  Generally, this 
>>>>>> has worked well for us, with one big caveat.  If you look for my 
>>>>>> posts to this list and the ZFS list you'll find more details.  
>>>>>> The short version is that we utilize ZFS snapshots and clones to 
>>>>>> do backups of the metadata.  We've run into situations where the 
>>>>>> backup process stalls, leaving a clone hanging around.  We've 
>>>>>> experienced a situation a couple of times where the clone and the 
>>>>>> primary zvol get swapped, effectively rolling back our metadata 
>>>>>> to the point when the clone was created.  I have tried, 
>>>>>> unsuccessfully, to recreate
>>>>>> that in a test environment.  So if you do that kind of setup, 
>>>>>> make sure you have good monitoring in place to detect if your 
>>>>>> backups/clones stall.  We've kept up with lustre and ZFS updates 
>>>>>> over the years and are currently on lustre 2.14 and ZFS 2.1.  
>>>>>> We've seen the gap between our ZFS MDT and ldiskfs performance 
>>>>>> shrink to the point where they are pretty much on par to each 
>>>>>> now.  I think our ZFS MDT performance could be better with more 
>>>>>> hardware and software tuning but our small team hasn't had the 
>>>>>> bandwidth to tackle that.
>>>>>>
>>>>>> Our newest LFS is vendor provided and uses NVMe MDT's. I'm not at 
>>>>>> liberty to talk about the proprietary way those devices are 
>>>>>> managed.  However, the metadata performance is SO much better 
>>>>>> than our older LFS's, for a lot of reasons, but I'd highly 
>>>>>> recommend NVMe's for your MDT's.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org 
>>>>>> <mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of 
>>>>>> Thomas Roth via lustre-discuss <lustre-discuss at lists.lustre.org 
>>>>>> <mailto:lustre-discuss at lists.lustre.org>>
>>>>>> Reply-To: Thomas Roth <t.roth at gsi.de <mailto:t.roth at gsi.de>>
>>>>>> Date: Friday, January 5, 2024 at 9:03 AM
>>>>>> To: Lustre Diskussionsliste <lustre-discuss at lists.lustre.org 
>>>>>> <mailto:lustre-discuss at lists.lustre.org>>
>>>>>> Subject: [EXTERNAL] [BULK] [lustre-discuss] MDS hardware - NVME?
>>>>>>
>>>>>>
>>>>>> CAUTION: This email originated from outside of NASA. Please take 
>>>>>> care when clicking links or opening attachments. Use the "Report 
>>>>>> Message" button to report suspicious messages to the NASA SOC.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dear all,
>>>>>>
>>>>>>
>>>>>> considering NVME storage for the next MDS.
>>>>>>
>>>>>>
>>>>>> As I understand, NVME disks are bundled in software, not by a 
>>>>>> hardware raid controller.
>>>>>> This would be done using Linux software raid, mdadm, correct?
>>>>>>
>>>>>>
>>>>>> We have some experience with ZFS, which we use on our OSTs.
>>>>>> But I would like to stick to ldiskfs for the MDTs, and a zpool 
>>>>>> with a zvol on top which is then formatted with ldiskfs - to much 
>>>>>> voodoo...
>>>>>>
>>>>>>
>>>>>> How is this handled elsewhere? Any experiences?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> The available devices are quite large. If I create a raid-10 out 
>>>>>> of 4 disks, e.g. 7 TB each, my MDT will be 14 TB - already close 
>>>>>> to the 16 TB limit.
>>>>>> So no need for a box with lots of U.3 slots.
>>>>>>
>>>>>>
>>>>>> But for MDS operations, we will still need a powerful dual-CPU 
>>>>>> system with lots of RAM.
>>>>>> Then the NVME devices should be distributed between the CPUs?
>>>>>> Is there a way to pinpoint this in a call for tender?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------
>>>>>> Thomas Roth
>>>>>>
>>>>>>
>>>>>> GSI Helmholtzzentrum für Schwerionenforschung GmbH
>>>>>> Planckstraße 1, 64291 Darmstadt, Germany, 
>>>>>> https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ 
>>>>>> <https://urldefense.us/v3/__http://www.gsi.de/__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF_rGY74EQ$ 
>>>>>> >
>>>>>>
>>>>>>
>>>>>> Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 
>>>>>> 1528
>>>>>> Managing Directors / Geschäftsführung:
>>>>>> Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
>>>>>> Chairman of the Supervisory Board / Vorsitzender des 
>>>>>> GSI-Aufsichtsrats:
>>>>>> State Secretary / Staatssekretär Dr. Volkmar Dietz
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org 
>>>>>> <mailto:lustre-discuss at lists.lustre.org>
>>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ 
>>>>>> <https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ 
>>>>>> >
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-discuss at lists.lustre.org
>>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1QmOnUbmSPpZPcc39XFZ3S-Vk4Dmh-Q78Gpm8ylYUf6Zhv_zpb2VXkM4C5Uhh05x01MhjqJTYZ5boqzEhkx6JF9_AFR58A$ 
>>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ 
>>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-discuss at lists.lustre.org
>>>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!1UvoXS5d2nzZ3sjN2lJffKL4enKN1ULr-gwh0xl3NuGT5owF5i6TrDiqASvF1KaxashD2Oi_jH8Gh2mRacLSzSKVdSk$ 
>>>>
>>>
>>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org
>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$ 
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G2kpM7uM-TzIFchu!y5wQck8C-c_SGpA2s-coHCN5mtNfCCoJoOAl3T4PQc4ZVk0tWFaA75pzY7vesMjwalFNgzSh-tLwV9r9ockyf5uya2t75w$