[Lustre-devel] replacing Lustre pings with LNet Peer Health

Nic Henke nic at cray.com
Tue May 17 07:30:08 PDT 2011


On 05/12/2011 12:37 PM, Christopher J. Morrone wrote:
> I think Eric's approach is the only sane way I've heard to reduce pings.
>
> Here are some issues that I see with this:
>
> 1)  For your solution to work, you require that the lnet layer take on
> pinging duties.  Usually the network, be it IB, TCP, whatever, will not
> provide any active notification of a peer failure.  To notice that a
> peer has died, the lnet LND must, you guessed it, ping.
>

Correct. I had assumed the LNDs would or could be doing the pinging. At 
worst it'd be done on a per-peer basis and not per-import, reducing the 
traffic somewhat. It'd also reduce the number of layers that need to be 
involved in the message RX, providing some CPU usage benefit.

> Usually the LNDs try to be smart.  They only generate their own pings if
> no traffic has been sent to the peer in a certain period of time.  So
> once you eliminate the higher-level pings, they will partly be replaced
> by lower-level pings.

Correct, and I thought that sufficient to provide reasonable notification.

Given the LNet router case, I think this idea is a bit DOA... unless I 
find some sort of non-gross magic :-)

Cheers,
Nic



More information about the lustre-devel mailing list