[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]
Charles A. Taylor
taylor at hpc.ufl.edu
Mon Aug 17 09:23:35 PDT 2009
FWIW, I posted this to ofa-general a little earlier. Anyone else
seeing this? Suggestions? I think this is an OFED 1.4.1 problem
but they may point the finger at you guys. :)
We've tried limiting OST threads to no avail. It doesn't really seem
to require a heavy load to trigger it - more or less random.
Charlie Taylor
UF HPC Center
-------- Forwarded Message --------
From: Charles A. Taylor <taylor at hpc.ufl.edu>
To: general at lists.openfabrics.org
Cc: Craig Prescott <prescott at hpc.ufl.edu>
Subject: [ofa-general] IPoIB Transmit Timeouts
Date: Mon, 17 Aug 2009 12:10:25 -0400
We upgraded our file servers to OFED 1.4.1 last Thursday and have since
been hit with a daily ration of the following across all eight of our
servers...
Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
msecs
Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
tx_tail 868165647
The difference between the head/tail is always 123. The send queue
size is 128 according to...
cat /sys/module/ib_ipoib/parameters/send_queue_size
128
>From the post below, others seem to have encountered this but we have
not seen any patches or work-arounds. Has anyone solved this problem?
They were very stable under OFED 1.2. We are running the
Lustre-patched kernel but we did that under OFED 1.2 + lustre 1.6.4.2 as
well and I'm pretty sure they don't touch the IB modules.
Relevant information:
=====================
CentOS 5.3
Lustre 1.8.0.1
2.6.18-128.1.6.el5_lustre.1.8.0.1smp
X86_64 (Opteron 275s)
hca_id: mthca0
fw_ver: 4.8.200
node_guid: 0005:ad00:0004:668c
sys_image_guid: 0002:c900:0100:d050
vendor_id: 0x02c9
vendor_part_id: 25208
hw_ver: 0xA0
board_id: MT_00A0000001
phys_port_cnt: 2
port: 1
state: active (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 49
port_lmc: 0x00
port: 2
state: active (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 98
port_lmc: 0x00
Charlie Taylor
UF HPC Center
> On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana <
> prade... at linux.vnet.ibm.com> wrote:
>
> > Hal Rosenstock wrote:
> > > Hi,
> > >
> > > I'm seeing the following messages from IPoIB:
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > NETDEV WATCHDOG: ib0: transmit timed out
> > > ib0: transmit timeout: latency 1374 msecs
> > > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565
> > >
> > > What are the possible (and most likely) causes of post_send failures ? I
> > > went through the code for all the errors (some at the driver level) but
> > > none popped out at me.
> > >
> >
> > Is it possible that the receiver is overwhelmed and hence the tx_ring is
> > full?
>
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the lustre-discuss
mailing list