[Lustre-discuss] ost_brw_write()

Kevin Van Maren Kevin.Vanmaren at Sun.COM
Wed Dec 31 09:34:41 PST 2008


I have previously observed cases where the RX checksum offload NIC would 
pass packets up
to Linux as "good" if the Ethernet CRC was valid, even though the UDP 
checksum failed (for
some reason it appeared that something (the sender?) was corrupting a 
byte in the payload after
calculating the UDP csum, but before the Ethernet CRC was calculated).

So disable any NIC offloading on both sides (ethtool) and see if the 
Lustre csums errors go away.

Also note that is you are using mmap files, it is _expected_ that the 
csum might not match,
as the page can be modified between when the csum is calculated by 
Luster, and the page
is actually transmitted.

Kevin


Mag Gam wrote:
> I have done the tuning but still occasionally get a CSUM error. About
> 200 per day.  Considering, we probally transfer close to 500G to 1TB
> of data a day is not that bad.
>
> I did the tuning on the e1000 card but I am not sure what else to do.
> The network guys have nothing wrong with their switch and the cables
> are fine (we even got them replaced).
>
> Since lustre has its own checksumming, I suppose I am in good shape...
>
>
>
>
> On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com> wrote:
>   
>> Brian. Thanks for getting back to me.
>>
>> Yes. The contents matched but getting the RX drop which is king of
>> scary. I am using the same machine when doing the test.
>>
>> I have already looked at the Lnet tests
>>
>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255
>>
>> For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets
>> me a RPC error but it seems my 5 servers get added. Wierd. Is there
>> better documentation or perhaps an example for the lnet tests I am
>> curious to try it.
>>
>> BTW, I am very happy to see this
>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952
>> (Last section regarding CRC). Where can I read more about this??
>>
>>
>>
>> Keep in mind, I am using e1000 NICs, and I think there is some tuning
>> I should be doing (but I am not certain if I am doing the right
>> tuning)
>>
>> TIA
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:
>>     
>>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:
>>>       
>>>> OK.
>>>>
>>>> It seems Lustre FS is dropping the packets.
>>>>         
>>> No.  Nobody said anything about packets being dropped.  They are failing
>>> checksum.
>>>
>>>       
>>>>  I did multiple FTPs and
>>>> they were very large files (10GB each), and no packet drops
>>>>         
>>> Did you verify the contents of what you ftp'd matched the original?  Are
>>> you using the same machines in your ftp tests that are reporting
>>> checksum failures with Lustre?
>>>
>>> You might want to look in our test suite and see if there is a checksum
>>> unit test.  I'd be surprised if there is not.  Maybe run that and see
>>> what the results are.  I'm afraid I don't have a lustre source tree very
>>> handy at the moment to check for you.
>>>
>>> b.
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>       
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   




More information about the lustre-discuss mailing list