[lustre-discuss] OSTs per OSS with ZFS

Dilger, Andreas andreas.dilger at intel.com
Thu Jul 6 16:01:25 PDT 2017


On Jul 6, 2017, at 15:43, Nathan R.M. Crawford <nrcrawfo at uci.edu> wrote:
> 
> On a somewhat-related question, what are the expected trade-offs when splitting the "striping" between ZFS (striping over vdevs) and Lustre (striping over OSTs)? 
> 
> Specific example: if one has an OSS with 40 disks and intends to use 10-disk raidz2 vdevs, how do these options compare?:
> 
> A) 4 OSTs, each on a zpool with a single raidz2 vdev,
> B) 2 OSTs, each on a zpool with two vdevs, and
> C) 1 OST, on a zpool with 4 vdevs?
> 
>   I've done some simple testing with obdfilter-survey and multiple-client file operations on some actual user data, and am leaning more toward "A". However, the differences weren't overwhelming, and I am probably neglecting some important corner cases. Handling striping pattern at the Lustre level (A) also allows tuning on a per-file basis.

As long as  your filesystem isn't going to have so many OSTs that it runs into scaling limits (over 2000 OSTs currently), it is typically true that having more independent OSTs (case A) gives somewhat better aggregate throughput when driven by many clients/threads, because the ZFS transaction commits are independent, compared to case C where all of the disks are inevitably waiting for the one disk that is slightly more busy/slow than the others on every transaction.  This is akin to jitter in parallel compute jobs.  Also, more independent OSTs reduces the amount of data lost in case of catastrophic failures of one OST.

That said, there are also drawbacks to having more OSTs as in case A.  This fragments your free space, and if you have a large number of clients it means the filesystem will need to be more conservative in space allocation as the filesystem fills with 4x as much free space as with case C.   On a per-OST basis you are also more likely to run out of space on a single OST when they are smaller.  Also, it isn't possible for single-stripe files to get as good performance with 4 separate OSTs as it is with one single large OST, assuming you are not already limited by the OSS network bandwidth (in which case you may as well just go with case C because the "extra performance" is unusable and you are just adding configuration/maintenance overhead).

Having at least 3 VDEVs in a single pool also slightly improves ZFS space efficiency and robustness, and reduces configuration management complexity and admin overhead, so if the performance is roughly the same I'd be inclined toward fewer/larger OSTs.  If the performance is dramatically different and your system is not already very large then the added OSTs may be worthwhile.

Cheers, Andreas

> On Mon, Jul 3, 2017 at 1:15 AM, Dilger, Andreas <andreas.dilger at intel.com> wrote:
>> We have seen performance improvements with multiple zpools/OSTs per OSS. However, with only 5x NVMe devices per OSS you don't have many choices in terms of redundancy, unless you are not using any redundancy at all, just raw bandwidth?
>> 
>> The other thing to consider is what the network bandwidth is vs. the NVMe bandwidth?  With similar test systems using NVMe devices without redundancy we've seen multi GB/s, so if you aren't using OPA/IB network then that will likely be your bottleneck. Even if the TCP is fast enough, the CPU overhead and data copies will probably kill the performance.
>> 
>> In the end, you can probably test with a few of configs to see which one will give the best performance - mirror, single RAID-Z, two RAID-Z pools on half-sized partitions, five no-redundancy zpools with one VDEV each, single no-redundancy zpool with five VDEVs.
>> 
>> Cheers, Andreas
>> 
>> PS - there is initial snapshot functionality in the 2.10 release.
>> 
>> > On Jul 2, 2017, at 10:07, Brian Andrus <toomuchit at gmail.com> wrote:
>> >
>> > All,
>> >
>> > We have been having some discussion about the best practices when creating OSTs with ZFS.
>> >
>> > The basic question is: What is the best ration of OSTs per OSS when using ZFS?
>> > It is easy enough to do a single OST with all disks and have reliable data protection provided by ZFS. It may be an better scenario when snapshots of lfs become a feature as well.
>> >
>> > However, multiple OSTs can mean more stripes and faster reads/writes. I have seen some tests that were done quite some time ago which may not be so valid anymore with the updates to Lustre.
>> >
>> > We have a system for testing that has 5 NVMes each. We can do 1 zfs file system with all or we can separate them into 5 (which would forgo some of the features of zfs).
>> >
>> > Any prior experience/knowledge/suggestions would be appreciated.
>> >
>> > Brian Andrus

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list