[Lustre-devel] replacing Lustre pings with LNet Peer Health

Nic Henke nic at cray.com
Tue May 17 07:27:43 PDT 2011


On 05/12/2011 12:27 PM, Andreas Dilger wrote:
> On May 12, 2011, at 08:57, Nic Henke wrote:
>> Just floating an idea... I'd much appreciate any feedback
>>

> One issue is that the Lustre OBD_PING RPC is not just detecting peer
> death.  It is also reporting the last_committed value to the RPC
> stack, so that clients can discard RPCs that were committed on the
> server.  It is also signalling to the server that this client is
> still alive, so that it doesn't get evicted.  If there are LNET
> routers in a system, the LNET peer health will only report the health
> of the routers, and not of the clients or servers behind the routers,
> so this isn't going to result in a working Lustre filesystem...
>

Good point, I had missed this. Pesky "working" filesystems...

>> Eric - I know this doesn't get us that far down the road toward
>> your new health network, but does solve a near term issue with
>> pinger rates on large systems.
>
> There would need to be at least some of the health network
> implemented in order to "pass through" the peer health on the
> routers, and also to broadcast some of the data, like last_rcvd.

Yeah, not sure how I thinko'd the LNet Router case. We'd need to add 
.lnd_notify into the LNDs and have them broadcast the failures at the 
router level. Not exactly ideal, and I think the use of lnd_notify has 
been dropped in favor of the newer LNet Peer Health.

Cheers,
Nic



More information about the lustre-devel mailing list