[lustre-discuss] BAD CHECKSUM

Dilger, Andreas andreas.dilger at intel.com
Thu Dec 7 12:36:15 PST 2017


On Dec 7, 2017, at 10:37, Hans Henrik Happe <happe at nbi.dk> wrote:
> 
> Hi,
> 
> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
> overwriting memory while being DMA'ed to network?
> 
> After upgrading to 2.10.1 on the server side we started seeing this from
> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
> errors. We have not yet established weather the application is doing
> things correctly.

If applications are using mmap IO it is possible for the page to become inconsistent after the checksum has been computed.  However, mmap IO is
normally detected by the client and no message should be printed.

There isn't anything that the application needs to do, since the client will resend the data if there is a checksum error, but the resends do slow down the IO.  If the inconsistency is on the client, there is no cause for concern (though it would be good to figure out the root cause).

It would be interesting to see what the exact error message is, since that will say whether the data became inconsistent on the client, or over the network.  If the inconsistency is over the network or on the server, then that may point to hardware issues.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation









More information about the lustre-discuss mailing list