[lustre-discuss] Switching server LNDs / NIDs and updating clients to use two LNDs on a single IP

Jesse Stroik jesse.stroik at ssec.wisc.edu
Wed Jan 8 11:01:29 PST 2025


Hi Lustre users,

I'm looking for a bit of a sanity check here before i go down this path.

I've been dealing with a communication problem over lnet that triggers under some conditions for one for our clusters after upgrading. I thought we'd solved it by disabling LNET multi-rail but that doesn't appear to be the case. Here's the report:

https://jira.whamcloud.com/browse/LU-18534

I'd like to try switching from ko2iblnd to ksocklnd. As these are data/scratch file systems for this cluster, and the cluster also accesses other file systems that are used more widely and working normally and will continue using @o2ib, I will need to set up the cluster clients with both @o2ib and @tcp interfaces on their same infiniband devices.

Here is what I'm thinking of doing:

- set up the cluster nodes with two NIDs using the same ip (eg: 172.16.23.100 at o2ib and 172.16.23.100 at tcp)
- change the NIDs of the scratch and data file system configurations to @tcp by using replace_nids on the MGS
- let the clients continue mounting other lustre file systems via @o2ib but update them to access scratch and data via @tcp NIDs.

Does this sound like something that should work or is it not worth attempting?

Thanks,
Jesse


More information about the lustre-discuss mailing list