[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

Isaac Huang He.Huang at Sun.COM
Mon Aug 17 15:36:24 PDT 2009


On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote:
> FWIW, I posted this to ofa-general a little earlier.   Anyone else
> seeing this?    Suggestions?    I think this is an OFED 1.4.1 problem
> but they may point the finger at you guys.  :)
> 
> We've tried limiting OST threads to no avail.   It doesn't really seem
> to require a heavy load to trigger it - more or less random.

I wouldn't think it's directly caused by Lustre. The IPoIB interface
is only needed for address resolution - no Lustre traffic would end up
sitting in the IPoIB interface's TX queue.

Have you tried to stress IPoIB, without Lustre running, with a TCP/IP 
benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a 'ping -f'?

Isaac

> ......
> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
> msecs
> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
> tx_tail 868165647
> 
> The difference between the head/tail is always 123.   The send queue
> size is 128 according to...
> 
> cat /sys/module/ib_ipoib/parameters/send_queue_size 
> 128



More information about the lustre-discuss mailing list