[Lustre-discuss] potential issue with data corruption

Lisa Giacchetti lisa at fnal.gov
Thu Jul 14 12:55:44 PDT 2011


Oleg,
thanks for your response.
See my responses inline.

lisa

On 7/14/11 2:47 PM, Oleg Drokin wrote:
> Hello!
>
> On Jul 14, 2011, at 1:59 PM, Lisa Giacchetti wrote:
>
>> Jul  7 07:10:08 cmsls6 kernel: Lustre: 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
>> Jul  7 07:59:42 cmsls6 kernel: Lustre: 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 7s ago has timed out (7s prior to deadline).
>> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.191.35 at tcp was evicted due to a lock completion callback to 131.225.191.35 at tcp timed out: rc -107
>> Jul  7 09:26:58 cmsls6 kernel: Lustre: 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
>> Jul  7 09:53:50 cmsls6 kernel: Lustre: 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 7s ago has timed out (7s prior to deadline).
>> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.204.88 at tcp was evicted due to a lock blocking callback to 131.225.204.88 at tcp timed out: rc -107
>> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.207.176 at tcp was evicted due to a lock blocking callback to 131.225.207.176 at tcp timed out: rc -107
>> Jul  7 10:23:01 cmsls6 kernel: Lustre: 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 7s ago has timed out (7s prior to deadline).
>> Jul  7 11:06:31 cmsls6 kernel: Lustre: 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
>> Jul  7 12:26:17 cmsls6 kernel: Lustre: 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 7s ago has timed out (7s prior to deadline).
>> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A client on nid 131.225.190.151 at tcp was evicted due to a lock blocking callback to 131.225.190.151 at tcp timed out: rc -107
>> Jul  7 12:26:17 cmsls6 kernel: LustreError: 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
>> Jul  7 12:26:17 cmsls6 kernel: Lustre: 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 12345-131.225.190.151 at tcp - client will retry
>> Jul  7 12:26:19 cmsls6 kernel: Lustre: 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring bulk IO comm error with f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 12345-131.225.190.151 at tcp - client will retry
>>
>>
>> Some of these errors seem really bad - like the bulk IO comm error or the eviction due to a locking call back.
>> What should I be looking for here?  I have determined some of the messages that say a client has been evicted cause the
>> OSS thinks its dead are not due the system being down. So what makes the OSS think the client is dead?
> Well, the clients become unresponsive for some reason, you really need to look at the client side logs for some clues on that.
I have been doing this as I was waiting for a reply and going through 
the manual and lustre-discuss archives.
Here is an example of one of the client's logs during the appropriate 
time frame:
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164 at tcp. The obd_ping operation 
failed with -107
Jul  7 11:55:33 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-ffff810617966400: Connection to service 
cmsprod1-OST0033 via nid 131.225.191.164 at tcp was lost; in progress 
operations using this service will wait for recovery to complete.
Jul  7 11:55:33 cmswn1526 kernel: LustreError: 11-0: an error occurred 
while communicating with 131.225.191.164 at tcp. The ost_write operation 
failed with -107
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 167-0: This client was 
evicted by cmsprod1-OST0033; in progress operations using this service 
will fail.
Jul  7 11:55:35 cmswn1526 kernel: LustreError: 
3750:0:(client.c:858:ptlrpc_import_delay_req()) @@@ IMP_INVALID  
req at ffff81031d414400 x1373265269802511/t0 
o4->cmsprod1-OST0033_UUID at 131.225.191.164@tcp:6/4 lens 448/608 e 0 to 1 
dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Jul  7 11:55:35 cmswn1526 kernel: Lustre: 
cmsprod1-OST0033-osc-ffff810617966400: Connection restored to service 
cmsprod1-OST0033 using nid 131.225.191.164 at tcp.

>> Also is there any way to determine what files are involved in these errors?
> Well, the lock blocking callbacks message will provide you with ost number and object index that you might be able to backreference to a file.
I know there is a way to do this from the /proc file system (at least I 
think its /proc) but I can't find any reference to this
in the book I got from class on this or in the manual.
Can someone refresh my memory?

> All that said, 1.8.3 is quite old and I think it would be a much better idea to try 1.8.6 and see if it improves things.
>
downtimes are few and far between for us so this may take a while to get 
scheduled.
If there is anything that can be done in the meantime I'd like to try it.

lisa

> Bye,
>      Oleg
> --
> Oleg Drokin
> Senior Software Engineer
> Whamcloud, Inc.
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/0e9d293e/attachment.vcf>


More information about the lustre-discuss mailing list