[lustre-discuss] Lustre checksum error

Gin Tan gin.tan at monash.edu
Tue Nov 22 21:56:38 PST 2016


Hi guys,

Seeing a lot of read and write checksum errors, in this case I'm showing
the errors while I'm doing md5sum on one single file, stripe count of 1.


 LustreError: 133-1: fs01-OST0010-osc-ffff881decdd7800: BAD READ CHECKSUM:
from 192.168.1.9 at o2ib inode [0x0:0x0:0x0] object 0x0:634095 extent
[71303168-72351743]

 LustreError: 3126:0:(osc_request.c:1492:osc_brw_fini_request()) client
7b4cff2f, server aa595f33, cksum_type 4

 LustreError: 3126:0:(osc_request.c:1492:osc_brw_fini_request()) Skipped
490 previous similar messages

LustreError: 3126:0:(osc_request.c:1528:osc_brw_redo_request()) @@@ redo
for recoverable error -11  req at ffff881463f36600 x1551586318841408/t0(0)
o3->fs01-OST0010-osc-ffff881decdd7800 at 192.168.1.9@o2ib:6/4 lens 488/400 e 0
to 0 dl 1479874333 ref 2 fl Interpret:RM/0/0 rc 1048576/1048576

LustreError: 3126:0:(osc_request.c:1528:osc_brw_redo_request()) Skipped 462
previous similar messages

LustreError: Skipped 492 previous similar messages

LustreError: 3107:0:(osc_request.c:1657:brw_interpret())
fs01-OST0007-osc-ffff881decdd7800: too many resent retries for object:
0:634208, rc = -11.

LustreError: 3107:0:(osc_request.c:1657:brw_interpret()) Skipped 27
previous similar messages

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Skipped 1 previous
similar message

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0

Lustre: 30841:0:(mdc_request.c:1484:mdc_read_page()) Page-wide hash
collision: 0x0


This is when I'm trying to unmount fs01:

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878225/real 1479878225]
req at ffff881656d2c800 x1551586341459476/t0(0)
o9->fs01-OST0019-osc-ffff881decdd7800 at 192.168.1.9@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878231 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878245/real 1479878245]
req at ffff880778804800 x1551586345177840/t0(0)
o9->fs01-OST0016-osc-ffff881decdd7800 at 192.168.1.9@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878251 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878258/real 1479878258]
req at ffff880778804800 x1551586347659620/t0(0)
o9->fs01-OST0014-osc-ffff881decdd7800 at 192.168.1.6@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878264 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878265/real 1479878265]
req at ffff880778804800 x1551586348899860/t0(0)
o9->fs01-OST0013-osc-ffff881decdd7800 at 192.168.1.5@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878271 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878272/real 1479878272]
req at ffff880778804800 x1551586350139276/t0(0)
o9->fs01-OST0012-osc-ffff881decdd7800 at 192.168.1.4@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878278 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878286/real 1479878286]
req at ffff880778804800 x1551586352619892/t0(0)
o9->fs01-OST0010-osc-ffff881decdd7800 at 192.168.1.9@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878292 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) Skipped 1
previous similar message

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878306/real 1479878306]
req at ffff880778804800 x1551586356341548/t0(0)
o9->fs01-OST000d-osc-ffff881decdd7800 at 192.168.1.5@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878312 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) Skipped 1
previous similar message

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) @@@ Request
sent has timed out for slow reply: [sent 1479878347/real 1479878347]
req at ffff880778804800 x1551586363782424/t0(0)
o9->fs01-OST0007-osc-ffff881decdd7800 at 192.168.1.5@o2ib:28/4 lens 224/224 e
0 to 1 dl 1479878353 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Lustre: 21829:0:(client.c:2048:ptlrpc_expire_one_request()) Skipped 3
previous similar messages

Lustre: Unmounted fs01-client

It finishes within 10 to 15 mins, re-mounted fs01, no more checksum errors.

This happens to multiple clients.

What I am seeing is client is stuck during the RPC, there's no network
issue between client and server, I did lnet selftest.

I'm trying to troubleshoot the checksum errors, what else should I check?

Thanks.

-- 
Regards,
Gin Tan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20161123/100b91fc/attachment.htm>


More information about the lustre-discuss mailing list