[lustre-devel] [PATCH 04/28] lustre: ptlrpc: Do not assert when bd_nob_transferred != 0

Sun Nov 4 13:29:47 PST 2018

> >> On Sun, Oct 14 2018, James Simmons wrote:
> >> 
> >> > From: Doug Oucharek <dougso at me.com>
> >> >
> >> > There is a case in the routine ptlrpc_register_bulk() where we were
> >> > asserting if bd_nob_transferred != 0 when not resending.  There is
> >> > evidence that network errors can create a situation where
> >> > this does happen. So we should not be asserting!
> >> >
> >> > This patch changes that assert to an error return code of -EIO.
> >> >
> >> > Signed-off-by: Doug Oucharek <dougso at me.com>
> >> > WC-bug-id: https://jira.whamcloud.com/browse/LU-9828
> >> > Reviewed-on: https://review.whamcloud.com/28491
> >> > Reviewed-by: Dmitry Eremin <dmitry.eremin at intel.com>
> >> > Reviewed-by: Sonia Sharma <sharmaso at whamcloud.com>
> >> > Reviewed-by: Oleg Drokin <green at whamcloud.com>
> >> > Signed-off-by: James Simmons <jsimmons at infradead.org>
> >> > ---
> >> >  drivers/staging/lustre/lustre/ptlrpc/niobuf.c | 8 ++++++--
> >> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >> >
> >> > diff --git a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > index 27eb1c0..7e7db24 100644
> >> > --- a/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > +++ b/drivers/staging/lustre/lustre/ptlrpc/niobuf.c
> >> > @@ -139,8 +139,12 @@ static int ptlrpc_register_bulk(struct ptlrpc_request *req)
> >> >  	/* cleanup the state of the bulk for it will be reused */
> >> >  	if (req->rq_resend || req->rq_send_state == LUSTRE_IMP_REPLAY)
> >> >  		desc->bd_nob_transferred = 0;
> >> > -	else
> >> > -		LASSERT(desc->bd_nob_transferred == 0);
> >> > +	else if (desc->bd_nob_transferred != 0)
> >> > +		/* If the network failed after an RPC was sent, this condition
> >> > +		 * could happen.  Rather than assert (was here before), return
> >> > +		 * an EIO error.
> >> > +		 */
> >> > +		return -EIO;
> >> 
> >> This looks weird, and the justification is rather lame.
> >> I wonder if this is an attempt to fix the same problem that the smp_mb()
> >> in the previous patch was attempting to fix (and I'm not yet convinced
> >> that either is the correct fix).
> >
> > When the above condition happens the LASSERT ends up taking out the 
> > node with a panic which in turn kills the application running on the cluster.
> > When replaced with reporting an EIO error the node survives as well as the 
> > job. The job might fail at its IO but it wouldn't fail performing its work 
> > flow which is way more important.
> 
> Yes, a meaningless error is better than a crash, but a proper fix is
> better still.  As I said, my guess is that the memory barrier in the
> previous patch might have fixed the bug, so the LASSERT can remain.
> 
> Doug: is there any chance that this might be the case?

I got a hold of Doug and discussed this issue. So the answer is that the
original logs to track down the original problem no longer exist. So 
finding the original source of the problem can't be done at this point.
Would you be okay with a version of this patch with dump_stack() and
treat it as a debug patch. We really need to collect logs to figure out
the real problem. I will push a debug patch to OpenSFS branch since it
is more widely used.