[lustre-discuss] Lustre and ZFS draid

Cameron Harr harr1 at llnl.gov
Fri Feb 7 10:26:49 PST 2025


We've been using draid in production since 2020 and I think were 
generally happy with it. We have quite a few Lustre clusters and on the 
majority of them, we run 90-drive JBODs with 1 OST/OSS node, 1 OST/pool 
and 1 pool/JBOD. We use a draid2:8d:90c:2s config and let the 
distributed spares rebuild (~2-4 hours) before replacing the 16TB 
physical disk, which then rebuilds within a day or so.

An important note with this configuration is that we also include NVMe 
in the pool as special allocation devices configured to store small 
blocks up to 16K. We probably have much more NVMe space than we need due 
to large NVMe drives (zpool list -v shows each mirror capacity is still 
low), but we're happy with performance.

	NAME                  STATE     READ WRITE CKSUM
	asp8                  ONLINE       0     0     0
	  draid2:8d:90c:2s-0  ONLINE       0     0     0
	    L0                ONLINE       0     0     0
	    L1                ONLINE       0     0     0
...
	    L88               ONLINE       0     0     0
	    L89               ONLINE       0     0     0
	special	
	  mirror-1            ONLINE       0     0     0
	    N6                ONLINE       0     0     0
	    N7                ONLINE       0     0     0
	  mirror-2            ONLINE       0     0     0
	    N8                ONLINE       0     0     0
	    N9                ONLINE       0     0     0
	  mirror-3            ONLINE       0     0     0
	    N10               ONLINE       0     0     0
	    N11               ONLINE       0     0     0
	spares
	  draid2-0-0          AVAIL
	  draid2-0-1          AVAIL

On our newest systems, we have some 106-drive JBODs with 20TB drives and 
in order to reduce the chance of multiple disk failures in a single 
draid device, we reconfigured the pools to have 2 draid devices per 
pool, though still one OST per pool and one OST per OSS. In this config 
we only have one distributed spare per draid. Due to significant write 
performance reasons we also (reluctantly) started spanning pools across 
2 JBODs. An additional difference is we had much less NVMe capacity on 
these systems with just one small pair of NVMe drives per enclosure, so 
we configure them as special devices for pool metadata rather than for 
small block storage. The config for one of those pools looks like the 
following:

	NAME                   STATE     READ WRITE CKSUM
	merced239              ONLINE       0     0     0
	  draid2:11d:53c:1s-0  ONLINE       0     0     0
	    L0                 ONLINE       0     0     0
	    L2                 ONLINE       0     0     0
	    L4                 ONLINE       0     0     0
...
	    L100               ONLINE       0     0     0
	    L102               ONLINE       0     0     0
	    L104               ONLINE       0     0     0
	  draid2:11d:53c:1s-1  ONLINE       0     0     0
	    U1                 ONLINE       0     0     0
	    U3                 ONLINE       0     0     0
	    U5                 ONLINE       0     0     0
...
	    U101               ONLINE       0     0     0
	    U103               ONLINE       0     0     0
	    U105               ONLINE       0     0     0
	special	
	  mirror-2             ONLINE       0     0     0
	    N2                 ONLINE       0     0     0
	    N3                 ONLINE       0     0     0
	spares
	  draid2-0-0           AVAIL
	  draid2-1-0           AVAIL

Hope this helps,
Cameron

On 2/6/25 11:29 AM, Nehring, Shane R [ITS] wrote:
> Hello All,
>
> I didn't want to hijack the other thread today about draid, but I have been
> meaning to ask questions about it and folks' experience with it in the context
> of Lustre. Most of my questions come from not having a chance to really play
> around with draid much.
>
> Have you been generally satisfied with performance of a single draid vdev vs
> either multiple pools/osts per node or single osts on a pool spanning multiple
> raidz(2) vdev members? Is random io comparable to a span of raidz2 vdevs? I know
> one of the pain points (more from a space usage perspective as I understand it)
> is the fixed stripe width and how that impacts small files, but does small file
> io perform particularly badly on draid vs a span raidz2?
>
> I've got hardware on order (a couple 60 bay jbods and heads) that's going to
> replace some of the older OSTs in our current volume and I'm leaning toward a
> single draid pool OST per OSS. I plan to do some benchmarking of the pools in
> various configurations, but it's hard to generate a benchmark that's actually
> representative of real world usage.
>
> If you've got any insights or anecdotes regarding your experience with draid and
> Lustre I'd love to hear them!
>
> Thanks,
> Shane
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250207/4be6dab7/attachment-0001.htm>


More information about the lustre-discuss mailing list