[lustre-discuss] BAD CHECKSUM

Mon Dec 11 22:46:20 PST 2017

On 10-12-2017 06:07, Dilger, Andreas wrote:
> Based on the messages on the client, this isn’t related to mmap() or
> writes done by the client, since the data has the same checksum from
> before it was sent and after it got the checksum error returned from
> the server. That means the pages did not change on the client.
>
> Possible causes include the client network card, server network card,
> memory, or possibly the OFED driver?  It could of course be something
> in Lustre/LNet, though we haven’t had any reports of anything similar. 
>
> When the checksum code was first written, it was motivated by a faulty
> Ethernet NIC that had TCP checksum offload, but bad onboard cache, and
> the data was corrupted when copied onto the NIC but the TCP checksum
> was computed on the bad data and the checksum was “correct” when
> received by the server, so it didn’t cause TCP resends. 
>
> Are you seeing this on multiple servers?  The client log only shows
> one server, while the server log shows multiple clients.  If it is
> only happening on one server it might point to hardware. 
Yes, we are seeing it on all servers.
> Did you also upgrade the kernel and OFED at the same time as Lustre?
> You could try building Lustre 2.10.1 on the old 2.9.0 kernel and OFED
> to see if that works properly.
We upgraded to CentOS 7.4 and are using the included OFED on the
servers. Also, we upgraded the firmware on the server IB cards. We will
check further if this combination has compatibility issues.

Cheers,
Hans Henrik
>
> Cheers, Andreas
>
> On Dec 9, 2017, at 11:09, Hans Henrik Happe <happe at nbi.dk
> <mailto:happe at nbi.dk>> wrote:
>
>>
>>
>> On 09-12-2017 18:57, Hans Henrik Happe wrote:
>>> On 07-12-2017 21:36, Dilger, Andreas wrote:
>>>> On Dec 7, 2017, at 10:37, Hans Henrik Happe <happe at nbi.dk
>>>> <mailto:happe at nbi.dk>> wrote:
>>>>> Hi,
>>>>>
>>>>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>>>>> overwriting memory while being DMA'ed to network?
>>>>>
>>>>> After upgrading to 2.10.1 on the server side we started seeing
>>>>> this from
>>>>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit
>>>>> these
>>>>> errors. We have not yet established weather the application is doing
>>>>> things correctly.
>>>> If applications are using mmap IO it is possible for the page to
>>>> become inconsistent after the checksum has been computed.  However,
>>>> mmap IO is
>>>> normally detected by the client and no message should be printed.
>>>>
>>>> There isn't anything that the application needs to do, since the
>>>> client will resend the data if there is a checksum error, but the
>>>> resends do slow down the IO.  If the inconsistency is on the
>>>> client, there is no cause for concern (though it would be good to
>>>> figure out the root cause).
>>>>
>>>> It would be interesting to see what the exact error message is,
>>>> since that will say whether the data became inconsistent on the
>>>> client, or over the network.  If the inconsistency is over the
>>>> network or on the server, then that may point to hardware issues.
>>> I've attached logs from a server and a client.
>>
>> There was a cut n' paste error in the first set of files. This should be
>> better.
>>
>> Looks like a something goes wrong over the network.
>>
>> Cheers,
>> Hans Henrik
>>
>> <client.log>
>> <server.log>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-discuss at lists.lustre.org <mailto:lustre-discuss at lists.lustre.org>
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20171212/6b812732/attachment.html>