[Lustre-discuss] osc_brw_redo_request error on clients

James Robnett jrobnett at AOC.NRAO.EDU
Wed Feb 9 16:24:45 PST 2011


> Normally I've had no problems but recently I have multiple clients
> reporting the following error:
>
> LustreError: 3935:0:(osc_request.c:1629:osc_brw_redo_request()) @@@ redo
> for recoverable error  req at ffff8101ae084000 x1358858531428366/t60136289752
> o4->lustre-OST0004_UUID at 192.168.1.12@o2ib:6/4 lens 448/608 e 0 to 1 dl
> 1297285890 ref 2 fl Interpret:R/0/0 rc 0/0
>
> which in turn appears to generate a premature EOF on our user software.
>
> There are no corresponding errors on the servers.

   The above is not true.  There are apparently corresponding errors of
the form:

Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
csum 964d53e2
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1038:ost_brw_write()) Skipped 43 previous similar
messages
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
BAD WRITE CHECKSUM: changed in transit before arrival at OST from
12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
[2384461824-2385510399]
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError: Skipped 43 previous
similar messages
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
server csum 964d53e2, server csum now 964d53e2
Feb  9 17:05:21 lustre-oss-1 kernel: LustreError:
2964:0:(ost_handler.c:1100:ost_brw_write()) Skipped 43 previous similar
messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1038:ost_brw_write()) client csum f00001, server
csum 180cd9bd
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1038:ost_brw_write()) Skipped 63 previous similar
messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: 168-f: lustre-OST0000:
BAD WRITE CHECKSUM: changed in transit before arrival at OST from
12345-10.64.1.212 at tcp inum 2981338/1802650709 object 8183950/0 extent
[4355784704-4356833279]
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError: Skipped 63 previous
similar messages
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1100:ost_brw_write()) client csum f00001, original
server csum 180cd9bd, server csum now 180cd9bd
Feb  9 17:10:22 lustre-oss-1 kernel: LustreError:
3035:0:(ost_handler.c:1100:ost_brw_write()) Skipped 63 previous similar
messages

   The other OSS shows similar errors.  We are doing mmap I/O and a
search implies those errors are related to mmap I/O.

   I'm open to suggestions, in the meantime the userspace code can be
switched from mmap to regular file I/O via an rc file so we'll try that
and see if it at least makes the errors go away.

James





More information about the lustre-discuss mailing list