[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

Craig Prescott prescott at hpc.ufl.edu
Mon Aug 17 18:16:33 PDT 2009


Isaac Huang wrote:
> On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote:
>> FWIW, I posted this to ofa-general a little earlier.   Anyone else
>> seeing this?    Suggestions?    I think this is an OFED 1.4.1 problem
>> but they may point the finger at you guys.  :)
>>
>> We've tried limiting OST threads to no avail.   It doesn't really seem
>> to require a heavy load to trigger it - more or less random.
> 
> I wouldn't think it's directly caused by Lustre. The IPoIB interface
> is only needed for address resolution - no Lustre traffic would end up
> sitting in the IPoIB interface's TX queue.

We are using a tcp NID on the (troubled) ib1 interfaces to reach our 
non-IB hosts.

We have o2ib NIDs on ib0 (dual-port HCA) to reach the 
InfiniBand-connected hosts on the same subnet.  No problems there.

> Have you tried to stress IPoIB, without Lustre running, with a TCP/IP 
> benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a 'ping -f'?

We've tried to stress IPoIB with netperf TCP_STREAM on a spare OSS (same 
hardware, same connectivity) running the same Lustre kernel.  No trouble 
so far.

Cheers,
Craig Prescott
UF HPC Center


> Isaac
> 
>> ......
>> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
>> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
>> msecs
>> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
>> tx_tail 868165647
>>
>> The difference between the head/tail is always 123.   The send queue
>> size is 128 according to...
>>
>> cat /sys/module/ib_ipoib/parameters/send_queue_size 
>> 128
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss




More information about the lustre-discuss mailing list