[Lustre-discuss] potential issue with data corruption

Thu Jul 14 11:15:02 PDT 2011

I am running 1.8.3 on servers and clients.
lisa

On 7/14/11 12:59 PM, Lisa Giacchetti wrote:
> Hi,
>  We are seeing a problem where some running jobs attempted to copy a 
> file from local disk
> on a worker node to a lustre file system. 14 of those files ended up 
> empty or truncated.
>
> We have 7 OSSs with either 6 or 12 ost's on each. All 14 files ended 
> up being on an ost on
> one of the two systems that have 12 osts. There are 12 different OST's 
> involved.
>
> So if I look at the messages file on one of those OSS's and I 
> specifically look for messages
> related to one of the OST's that have a truncated or empty file I see 
> things like this:
>
> Jul  7 07:10:08 cmsls6 kernel: Lustre: 
> 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
> Jul  7 07:59:42 cmsls6 kernel: Lustre: 
> 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 
> 7s ago has timed out (7s prior to deadline).
> Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.191.35 at tcp was evicted due to a lock completion 
> callback to 131.225.191.35 at tcp timed out: rc -107
> Jul  7 09:26:58 cmsls6 kernel: Lustre: 
> 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
> Jul  7 09:53:50 cmsls6 kernel: Lustre: 
> 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 
> 7s ago has timed out (7s prior to deadline).
> Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.204.88 at tcp was evicted due to a lock blocking 
> callback to 131.225.204.88 at tcp timed out: rc -107
> Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.207.176 at tcp was evicted due to a lock blocking 
> callback to 131.225.207.176 at tcp timed out: rc -107
> Jul  7 10:23:01 cmsls6 kernel: Lustre: 
> 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905675944 sent from cmsprod1-OST002d to NID 
> 131.225.204.118 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 11:06:31 cmsls6 kernel: Lustre: 
> 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
> e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 
> 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
> x1359120905703492 sent from cmsprod1-OST002d to NID 
> 131.225.190.151 at tcp 7s ago has timed out (7s prior to deadline).
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
> client on nid 131.225.190.151 at tcp was evicted due to a lock blocking 
> callback to 131.225.190.151 at tcp timed out: rc -107
> Jul  7 12:26:17 cmsls6 kernel: LustreError: 
> 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on 
> destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID 
> lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 
> 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 
> remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
> Jul  7 12:26:17 cmsls6 kernel: Lustre: 
> 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
> bulk IO comm error with 
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
> 12345-131.225.190.151 at tcp - client will retry
> Jul  7 12:26:19 cmsls6 kernel: Lustre: 
> 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
> bulk IO comm error with 
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
> 12345-131.225.190.151 at tcp - client will retry
>
>
> Some of these errors seem really bad - like the bulk IO comm error or 
> the eviction due to a locking call back.
> What should I be looking for here?  I have determined some of the 
> messages that say a client has been evicted cause the
> OSS thinks its dead are not due the system being down. So what makes 
> the OSS think the client is dead?
>
> Also is there any way to determine what files are involved in these 
> errors?
>
> lisa
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/f0a7dc66/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/f0a7dc66/attachment.vcf>