[Lustre-discuss] bad write checksums
Andreas Dilger
adilger at sun.com
Fri Jul 24 14:03:50 PDT 2009
On Jul 24, 2009 14:06 -0600, Andreas Dilger wrote:
> On Jul 24, 2009 10:33 -0400, Craig Prescott wrote:
> > We've been testing some 1.8.0.1 patchless clients (RHEL5.3, x86_64, RPMs
> > from the Sun download page) with our 1.6.4.2 servers.
> >
> > The OSS nodes started logging these LustreErrors from the 1.8.0.1 clients:
> >
> > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) client csum 8448447f, original server csum 66fb7cff, server csum now 66fb7cff
> > > LustreError: 7302:0:(ost_handler.c:1157:ost_brw_write()) Skipped 1 previous similar message
> > > LustreError: 7391:0:(ost_handler.c:1095:ost_brw_write()) client csum 9d8c7d6a, server csum 2cfdcb47
> > > LustreError: 168-f: ufhpc-OST0004: BAD WRITE CHECKSUM: changed in transit before arrival at OST from 12345-10.13.28.55 at tcp inum 38470778/1485322248 object 67094039/0 extent [0-1023]
> >
> > Is this a known issue with running 1.8.0.1 clients against 1.6.4.2
> > servers? We aren't seeing these messages in relation to our 1.6 clients.
>
> This is a known issue if the clients are using mmap IO (which can change
> the kernel pages w/o notifying the kernel. It would be possible to fix
> this warning by adding a "file is mmapped" flag to the RPC and suppress
> the console error on the server and subsequent error message if the IO
> never makes it to the server at least once in the next 5 retries.
>
> Unfortunately, since this is a non-fatal error, nobody has worked on
> fixing it yet.
PS - of course, if mmap is not involved and the errors are isolated to
particular client/server nodes it is entirely possible that the network
is corrupting the data in transit, as the message suggests.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
More information about the lustre-discuss
mailing list