[lustre-devel] Apparent bug in LU-8130 ptlrpc: convert conn_hash to rhashtable / Linux-commit: ac2370ac2bc5215daf78546cd8d925510065bb7f
paf at cray.com
Tue Nov 6 09:38:14 PST 2018
Sent this earlier, but with nasty formatting that got it rejected. Whoops.
It looks like the patch we landed as:
LU-8130 ptlrpc: convert conn_hash to rhashtable
Linux has a resizeable hashtable implementation in lib,
so we should use that instead of having one in libcfs.
This patch converts the ptlrpc conn_hash to use rhashtable.
In the process we gain lockless lookup.
As connections are never deleted until the hash table is destroyed,
there is no need to count the reference in the hash table. There
is also no need to enable automatic_shrinking.
Introduced a bug. Ihara-san opened something to track it here:
It’s a null pointer in nid_hash(); there are some more details at that link.
We’re seeing it at Cray as well, when testing the current WhamCloud branch.
Basically, when we fail over an MDS(/MDT) under load (ie with real activity on the file system) we hit this panic about 30-50% of the time right now. I assume it’s possible on OSSes as well but we haven’t seen it there.
I haven’t done any detailed investigation, but I thought I’d bring it to your attention. Per Ihara-san in LU-11624, the crash does not happen without the commit listed above.
More information about the lustre-devel