[lustre-devel] Fwd: [PATCH 0/1] Rosebush, a new hash table

Thu Feb 22 13:04:34 PST 2024

Time to start replacing the use of rhashtable in the code? :-)

> Begin forwarded message:
> 
> From: "Matthew Wilcox (Oracle)" <willy at infradead.org>
> Subject: [PATCH 0/1] Rosebush, a new hash table
> Date: February 22, 2024 at 1:37:23 PM MST
> To: linux-kernel at vger.kernel.org
> Cc: "Matthew Wilcox (Oracle)" <willy at infradead.org>, Thomas Graf <tgraf at suug.ch>, Herbert Xu <herbert at gondor.apana.org.au>, netdev at vger.kernel.org, linux-fsdevel at vger.kernel.org, maple-tree at lists.infradead.org, rcu at vger.kernel.org
> 
> Rosebush is a resizing, scalable, cache-aware, RCU optimised hash table.
> I've written a load of documentation about how it works, mostly in
> Documentation/core-api/rosebush.rst but some is dotted through the
> rosebush.c file too.
> 
> You can see this code as a further exploration of the "Linked lists are
> evil" design space.  For the workloads which a hashtable is suited to,
> it has lower overhead than either the maple tree or the rhashtable.
> It cannot support ranges (so is not a replacement for the maple tree),
> but it does have per-bucket locks so is more scalable for write-heavy
> workloads.  I suspect one could reimplement the rhashtable API on top
> of the rosebush, but I am not interested in doing that work myself.
> 
> The per-object overhead is 12 bytes, as opposed to 16 bytes for the
> rhashtable and 32 bytes for the maple tree.  The constant overhead is also
> small, being just 16 bytes for the struct rosebush.  The exact amount
> of memory consumed for a given number of objects is going to depend on
> the distribution of hashes; here are some estimated consumptions for
> power-of-ten entries distributed evenly over a 32-bit hash space in the
> various data structures:
> 
> number	xarray	maple	rhash	rosebush
> 1	3472	272	280	256
> 10	32272	784	424	256
> 100	262kB	3600	1864	2080
> 1000	[1]	34576	17224	16432
> 10k	[1]	343k	168392	131344
> 100k	[1]	3.4M	1731272	2101264
> 
> As you can see, rosebush and rhashtable are close the whole way.
> Rosebush moves in larger chunks because it doubles each time; there's
> no actual need to double the bucket size, but that works well with
> the slab allocator's existing slabs.  As noted in the documentation,
> we could create our own slabs and get closer to the 12 bytes per object
> minimum consumption. [2]
> 
> Where I expect rosebush to shine is on dependent cache misses.
> I've assumed an average chain length of 10 for rhashtable in the above
> memory calculations.  That means on average a lookup would take five cache
> misses that can't be speculated.  Rosebush does a linear walk of 4-byte
> hashes looking for matches, so the CPU can usefully speculate the entire
> array of hash values (indeed, we tell it exactly how many we're going to
> look at) and then take a single cache miss fetching the correct pointer.
> Add that to the cache miss to fetch the bucket and that's just two cache
> misses rather than five.
> 
> I have not yet converted any code to use the rosebush.  The API is
> designed for use by the dcache, and I expect it will evolve as it actually
> gets used.  I think there's probably some more refactoring to be done.
> I am not aware of any bugs, but the test suite is pitifully small.
> The current algorithm grows the buckets more aggressively than the table;
> that's probably exactly the wrong thing to do for good performance.
> 
> This version is full of debugging printks.  You should probably take
> them out if you're going to try to benchmark it.  The regex '^print'
> should find them all.  Contributions welcome; I really want to get back
> to working on folios, but this felt like an urgent problem to be fixed.
> 
> [1] I stopped trying to estimate the memory costs for xarray; I couldn't
> be bothered to as it's not a serious competitor for this use case.
> 
> [2] We have ideas for improving the maple tree memory consumption for
> this kind of workload; a new node type for pivots that fit in 4 bytes and
> sparse nodes to avoid putting a NULL entry after each occupied entry.
> The maple tree really is optimised for densely packed ranges at the
> moment.
> 
> Matthew Wilcox (Oracle) (1):
>  rosebush: Add new data structure
> 
> Documentation/core-api/index.rst    |   1 +
> Documentation/core-api/rosebush.rst | 135 ++++++
> MAINTAINERS                         |   8 +
> include/linux/rosebush.h            |  41 ++
> lib/Kconfig.debug                   |   3 +
> lib/Makefile                        |   3 +-
> lib/rosebush.c                      | 707 ++++++++++++++++++++++++++++
> lib/test_rosebush.c                 | 135 ++++++
> 8 files changed, 1032 insertions(+), 1 deletion(-)
> create mode 100644 Documentation/core-api/rosebush.rst
> create mode 100644 include/linux/rosebush.h
> create mode 100644 lib/rosebush.c
> create mode 100644 lib/test_rosebush.c
> 
> --
> 2.43.0
> 
> 

Cheers, Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240222/57d84ea4/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 873 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20240222/57d84ea4/attachment.sig>