[Lustre-devel] replacing Lustre pings with LNet Peer Health
Nic Henke
nic at cray.com
Tue May 17 07:27:43 PDT 2011
On 05/12/2011 12:27 PM, Andreas Dilger wrote:
> On May 12, 2011, at 08:57, Nic Henke wrote:
>> Just floating an idea... I'd much appreciate any feedback
>>
> One issue is that the Lustre OBD_PING RPC is not just detecting peer
> death. It is also reporting the last_committed value to the RPC
> stack, so that clients can discard RPCs that were committed on the
> server. It is also signalling to the server that this client is
> still alive, so that it doesn't get evicted. If there are LNET
> routers in a system, the LNET peer health will only report the health
> of the routers, and not of the clients or servers behind the routers,
> so this isn't going to result in a working Lustre filesystem...
>
Good point, I had missed this. Pesky "working" filesystems...
>> Eric - I know this doesn't get us that far down the road toward
>> your new health network, but does solve a near term issue with
>> pinger rates on large systems.
>
> There would need to be at least some of the health network
> implemented in order to "pass through" the peer health on the
> routers, and also to broadcast some of the data, like last_rcvd.
Yeah, not sure how I thinko'd the LNet Router case. We'd need to add
.lnd_notify into the LNDs and have them broadcast the failures at the
router level. Not exactly ideal, and I think the use of lnd_notify has
been dropped in favor of the newer LNet Peer Health.
Cheers,
Nic
More information about the lustre-devel
mailing list