[lustre-discuss] Switching server LNDs / NIDs and updating clients to use two LNDs on a single IP
Jesse Stroik
jesse.stroik at ssec.wisc.edu
Wed Jan 8 11:01:29 PST 2025
Hi Lustre users,
I'm looking for a bit of a sanity check here before i go down this path.
I've been dealing with a communication problem over lnet that triggers under some conditions for one for our clusters after upgrading. I thought we'd solved it by disabling LNET multi-rail but that doesn't appear to be the case. Here's the report:
https://jira.whamcloud.com/browse/LU-18534
I'd like to try switching from ko2iblnd to ksocklnd. As these are data/scratch file systems for this cluster, and the cluster also accesses other file systems that are used more widely and working normally and will continue using @o2ib, I will need to set up the cluster clients with both @o2ib and @tcp interfaces on their same infiniband devices.
Here is what I'm thinking of doing:
- set up the cluster nodes with two NIDs using the same ip (eg: 172.16.23.100 at o2ib and 172.16.23.100 at tcp)
- change the NIDs of the scratch and data file system configurations to @tcp by using replace_nids on the MGS
- let the clients continue mounting other lustre file systems via @o2ib but update them to access scratch and data via @tcp NIDs.
Does this sound like something that should work or is it not worth attempting?
Thanks,
Jesse
More information about the lustre-discuss
mailing list