[Lustre-discuss] Network Package loss

Isaac Huang He.Huang at Sun.COM
Mon Nov 9 17:41:27 PST 2009


On Mon, Nov 09, 2009 at 02:48:34PM +0100, Heiko Schröter wrote:
> Hello,
> 
> we do encounter peaks of upto 30% package loss in our Gigabit Network.

It would be helpful if you'd elaborate on where the 30% came from.

> This is sporadic, say once every hour remaining for some seconds. We cannot specify if it extends into minutes.
> We do relate this to a very high peak load on the net.
> 
> Could it be that lustre 'reconnect' messages or 'lnet_try_match_md()' are correlated to this ?

I'm not sure which 'reconnect' you meant, but usually they're rate
limited and backed off exponentially so I'd be surprised that
reconnection requests were overwhelming the network.

The 'lnet_try_match_md()' errors are usually caused by buffer
management problems in Lustre services, which would result in incoming
messages being dropped. If the other end resends those messages
aggressively, it could be a problem but now there's too little clue to
tell.

> i.e. the mds has problems to match infos between osts and mgs ...
> What happens inside lustre when it stumbles across famous 'package loss' on the net ? (Any timeout/retry counters ???)

Usually packet loss is handled by TCP. If you'd enable network error
console logging you'd see some errors when TCP has given up
retransmission: echo +neterror > /proc/sys/lnet/printk

Thanks,
Isaac



More information about the lustre-discuss mailing list