[Lustre-discuss] lustre errors when system stressed; bad hardware?
Isaac Huang
He.Huang at Sun.COM
Thu Aug 27 14:37:51 PDT 2009
On Wed, Aug 26, 2009 at 06:52:24PM -0700, Abe Ingersoll wrote:
> ......
> kiblnd_tx_complete()) Tx -> 10.168.22.104 at o2ib cookie 0xc8dd6 sending 1
> waiting 1: failed 12
12 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the
network or some other application (like a MPI application) hogging
network resources unfavorably against Lustre. We once observed such
errors at times there was no IO at all - a bad MPI implementation was
resending aggressively upon RNR such that even the tiny bit of
keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR.
Diagnostics from OFED and the fabric should point you to faulty
hardware, and setting up IB QoS should prevent Lustre from being hurt
badly by someone else.
Meanwhile, there's a potential workaround mentioned here:
https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36
But it's certainly not a good solution in the long run.
Thanks,
Isaac
More information about the lustre-discuss
mailing list