[Lustre-discuss] potential issue with data corruption

Oleg Drokin green at whamcloud.com
Thu Jul 14 12:47:54 PDT 2011


Hello!

On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:

> Jul  7 07:10:08 cmsls6 kernel: Lustre: 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
> Jul  7 07:59:42 cmsls6 kernel: Lustre: 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.191.35 at tcp was evicted due to a lock completion callback to 131.225.191.35 at tcp timed out: rc -107
> Jul  7 09:26:58 cmsls6 kernel: Lustre: 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
> Jul  7 09:53:50 cmsls6 kernel: Lustre: 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.204.88 at tcp was evicted due to a lock blocking callback to 131.225.204.88 at tcp timed out: rc -107
> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.207.176 at tcp was evicted due to a lock blocking callback to 131.225.207.176 at tcp timed out: rc -107
> Jul  7 10:23:01 cmsls6 kernel: Lustre: 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 11:06:31 cmsls6 kernel: Lustre: 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.190.151 at tcp was evicted due to a lock blocking callback to 131.225.190.151 at tcp timed out: rc -107
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 12345-131.225.190.151 at tcp - client will retry
> Jul  7 12:26:19 cmsls6 kernel: Lustre: 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 12345-131.225.190.151 at tcp - client will retry
> 
> 
> Some of these errors seem really bad - like the bulk IO comm error or the eviction due to a locking call back.
> What should I be looking for here?  I have determined some of the messages that say a client has been evicted cause the
> OSS thinks its dead are not due the system being down. So what makes the OSS think the client is dead?

Well, the clients become unresponsive for some reason, you really need to look at the client side logs for some clues on that.

> Also is there any way to determine what files are involved in these errors?

Well, the lock blocking callbacks message will provide you with ost number and object index that you might be able to backreference to a file.

All that said, 1.8.3 is quite old and I think it would be a much better idea to try 1.8.6 and see if it improves things.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.




More information about the lustre-discuss mailing list