[Lustre-discuss] lustre errors when system stressed; bad hardware?

Thu Aug 27 14:37:51 PDT 2009

On Wed, Aug 26, 2009 at 06:52:24PM -0700, Abe Ingersoll wrote:
>    ......
>    kiblnd_tx_complete()) Tx -> 10.168.22.104 at o2ib cookie 0xc8dd6 sending 1
>    waiting 1: failed 12

12 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the
network or some other application (like a MPI application) hogging
network resources unfavorably against Lustre. We once observed such
errors at times there was no IO at all - a bad MPI implementation was
resending aggressively upon RNR such that even the tiny bit of
keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR.

Diagnostics from OFED and the fabric should point you to faulty
hardware, and setting up IB QoS should prevent Lustre from being hurt
badly by someone else.

Meanwhile, there's a potential workaround mentioned here:
https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36

But it's certainly not a good solution in the long run.

Thanks,
Isaac