[lustre-discuss] BAD CHECKSUM

Patrick Farrell paf at cray.com
Thu Dec 7 13:20:35 PST 2017


I would think it's possible if the application is doing direct I/O. This
should be impossible for buffered I/O, since the checksums are all
calculated after the copies in to kernel memory (the page cache) are
complete, so it doesn¹t matter what userspace does to its memory (at
least, it doesn¹t matter for the checksums).

And I¹m not 100% sure it¹s possible for direct.  I would think it is.
Someone else might be able to weigh in there - but it¹s definitely not
possible for buffered I/O.


It would be good, as Andreas said, to see the exact message.

One other thought: While the Lustre client might resend correctly, I would
think it extremely likely unintentionally messing with memory being used
for I/O represents a serious application bug, likely to lead to incorrect
operation.

Regards,
- Patrick

On 12/7/17, 2:36 PM, "lustre-discuss on behalf of Dilger, Andreas"
<lustre-discuss-bounces at lists.lustre.org on behalf of
andreas.dilger at intel.com> wrote:

>On Dec 7, 2017, at 10:37, Hans Henrik Happe <happe at nbi.dk> wrote:
>> 
>> Hi,
>> 
>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>> overwriting memory while being DMA'ed to network?
>> 
>> After upgrading to 2.10.1 on the server side we started seeing this from
>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>> errors. We have not yet established weather the application is doing
>> things correctly.
>
>If applications are using mmap IO it is possible for the page to become
>inconsistent after the checksum has been computed.  However, mmap IO is
>normally detected by the client and no message should be printed.
>
>There isn't anything that the application needs to do, since the client
>will resend the data if there is a checksum error, but the resends do
>slow down the IO.  If the inconsistency is on the client, there is no
>cause for concern (though it would be good to figure out the root cause).
>
>It would be interesting to see what the exact error message is, since
>that will say whether the data became inconsistent on the client, or over
>the network.  If the inconsistency is over the network or on the server,
>then that may point to hardware issues.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Lustre Principal Architect
>Intel Corporation
>
>
>
>
>
>
>
>_______________________________________________
>lustre-discuss mailing list
>lustre-discuss at lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



More information about the lustre-discuss mailing list