[Lustre-discuss] potential issue with data corruption
Lisa Giacchetti
lisa at fnal.gov
Thu Jul 14 11:15:02 PDT 2011
I am running 1.8.3 on servers and clients.
lisa
On 7/14/11 12:59 PM, Lisa Giacchetti wrote:
> Hi,
> We are seeing a problem where some running jobs attempted to copy a
> file from local disk
> on a worker node to a lustre file system. 14 of those files ended up
> empty or truncated.
>
> We have 7 OSSs with either 6 or 12 ost's on each. All 14 files ended
> up being on an ost on
> one of the two systems that have 12 osts. There are 12 different OST's
> involved.
>
> So if I look at the messages file on one of those OSS's and I
> specifically look for messages
> related to one of the OST's that have a truncated or empty file I see
> things like this:
>
> Jul 7 07:10:08 cmsls6 kernel: Lustre:
> 15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
> c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
> Jul 7 07:59:42 cmsls6 kernel: Lustre:
> 3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
> x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp
> 7s ago has timed out (7s prior to deadline).
> Jul 7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
> client on nid 131.225.191.35 at tcp was evicted due to a lock completion
> callback to 131.225.191.35 at tcp timed out: rc -107
> Jul 7 09:26:58 cmsls6 kernel: Lustre:
> 15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
> 9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
> Jul 7 09:53:50 cmsls6 kernel: Lustre:
> 2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
> x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp
> 7s ago has timed out (7s prior to deadline).
> Jul 7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
> client on nid 131.225.204.88 at tcp was evicted due to a lock blocking
> callback to 131.225.204.88 at tcp timed out: rc -107
> Jul 7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
> client on nid 131.225.207.176 at tcp was evicted due to a lock blocking
> callback to 131.225.207.176 at tcp timed out: rc -107
> Jul 7 10:23:01 cmsls6 kernel: Lustre:
> 15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
> x1359120905675944 sent from cmsprod1-OST002d to NID
> 131.225.204.118 at tcp 7s ago has timed out (7s prior to deadline).
> Jul 7 11:06:31 cmsls6 kernel: Lustre:
> 15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d:
> e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
> Jul 7 12:26:17 cmsls6 kernel: Lustre:
> 15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
> x1359120905703492 sent from cmsprod1-OST002d to NID
> 131.225.190.151 at tcp 7s ago has timed out (7s prior to deadline).
> Jul 7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A
> client on nid 131.225.190.151 at tcp was evicted due to a lock blocking
> callback to 131.225.190.151 at tcp timed out: rc -107
> Jul 7 12:26:17 cmsls6 kernel: LustreError:
> 15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on
> destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID
> lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res:
> 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0
> remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
> Jul 7 12:26:17 cmsls6 kernel: Lustre:
> 2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring
> bulk IO comm error with
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id
> 12345-131.225.190.151 at tcp - client will retry
> Jul 7 12:26:19 cmsls6 kernel: Lustre:
> 2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring
> bulk IO comm error with
> f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id
> 12345-131.225.190.151 at tcp - client will retry
>
>
> Some of these errors seem really bad - like the bulk IO comm error or
> the eviction due to a locking call back.
> What should I be looking for here? I have determined some of the
> messages that say a client has been evicted cause the
> OSS thinks its dead are not due the system being down. So what makes
> the OSS think the client is dead?
>
> Also is there any way to determine what files are involved in these
> errors?
>
> lisa
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/f0a7dc66/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/f0a7dc66/attachment.vcf>
More information about the lustre-discuss
mailing list