[lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

Thu Oct 26 08:45:10 PDT 2023

Hello,

Recently we had an OSS node down for an extended period with hardware problems. While the node was down, mounting lustre on a client took an extremely long time to complete (20-30 minutes). Once the fs is mounted, all operations are normal and there isn't any noticeable impact from the absent node.

While the client is mounting, the client's debug log shows entries like this slowly going by:

00000020:00000080:87.0:1698333195.993098:0:3801046:0:(obd_config.c:1384:class_process_config()) processing cmd: cf005
00000020:00000080:87.0:1698333195.993099:0:3801046:0:(obd_config.c:1396:class_process_config()) adding mapping from uuid 10.1.2.3 at o2ib to nid 0x500000abcd123 (10.1.2.4 at o2ib)

and there is a "llog_process_th" kernel thread hanging in lnet_discover_peer_locked().

We have peer discovery enabled on our clients, but disabling peer discovery on a client causes the mount to complete quickly. Also, once the down OSS was fixed and powered back on, mounting completed normally again.

We also found that reducing the following timeout sped up the mount by a factor of ~10:

$ lnetctl set transaction_timeout 5    # was 50 originally

Is such a dramatic slowdown normal in this situation? Is there any fix (aside from disabling peer discovery or tuning down the timeout) that could speed up mounts in case we have another OSS down in the future?

Lustre version (server and client): 2.15.3

Thanks, 
Thomas Bertschinger