[Lustre-discuss] bad write checksums

Andreas Dilger adilger at sun.com
Fri Jul 24 13:06:39 PDT 2009


On Jul 24, 2009  10:33 -0400, Craig Prescott wrote:
> We've been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs 
> from the Sun download page) with our 1.6.4.2 servers.
> 
> The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients:
> 
> > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff
> > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message
> > LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47
> > LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28.55 at tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023]
> 
> Is this a known issue with running 1.8.0.1 clients against 1.6.4.2 
> servers?  We aren't seeing these messages in relation to our 1.6 clients.

This is a known issue if the clients are using mmap IO (which can change
the kernel pages w/o notifying the kernel.  It would be possible to fix
this warning by adding a "file is mmapped" flag to the RPC and suppress
the console error on the server and subsequent error message if the IO
never makes it to the server at least once in the next 5 retries.

Unfortunately, since this is a non-fatal error, nobody has worked on
fixing it yet.

> Looking through the Lustre bugzilla, I see bug 18296, which discusses 
> these messages, but it was logged against Lustre version 1.6.6.

The 1.6 and 1.8 code is very similar, with only a handful of isolated
features added.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.




More information about the lustre-discuss mailing list