[Lustre-discuss] potential issue with data corruption

Lisa Giacchetti lisa at fnal.gov
Thu Jul 14 10:59:12 PDT 2011


Hi,
  We are seeing a problem where some running jobs attempted to copy a 
file from local disk
on a worker node to a lustre file system. 14 of those files ended up 
empty or truncated.

We have 7 OSSs with either 6 or 12 ost's on each. All 14 files ended up 
being on an ost on
one of the two systems that have 12 osts. There are 12 different OST's 
involved.

So if I look at the messages file on one of those OSS's and I 
specifically look for messages
related to one of the OST's that have a truncated or empty file I see 
things like this:

Jul  7 07:10:08 cmsls6 kernel: Lustre: 
15431:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
c03badd9-c242-1507-6824-3a9648c8b21f reconnecting
Jul  7 07:59:42 cmsls6 kernel: Lustre: 
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905647245 sent from cmsprod1-OST002d to NID 131.225.191.35 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 07:59:42 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.191.35 at tcp was evicted due to a lock completion 
callback to 131.225.191.35 at tcp timed out: rc -107
Jul  7 09:26:58 cmsls6 kernel: Lustre: 
15433:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
9235f65e-ff71-2b1f-60fb-c049cbad5728 reconnecting
Jul  7 09:53:50 cmsls6 kernel: Lustre: 
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905668862 sent from cmsprod1-OST002d to NID 131.225.204.88 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 09:53:50 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.204.88 at tcp was evicted due to a lock blocking 
callback to 131.225.204.88 at tcp timed out: rc -107
Jul  7 10:18:57 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.207.176 at tcp was evicted due to a lock blocking 
callback to 131.225.207.176 at tcp timed out: rc -107
Jul  7 10:23:01 cmsls6 kernel: Lustre: 
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905675944 sent from cmsprod1-OST002d to NID 131.225.204.118 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 11:06:31 cmsls6 kernel: Lustre: 
15341:0:(ldlm_lib.c:575:target_handle_reconnect()) cmsprod1-OST002d: 
e25b2761-680a-4d94-ed2c-10913403c0a3 reconnecting
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request 
x1359120905703492 sent from cmsprod1-OST002d to NID 131.225.190.151 at tcp 
7s ago has timed out (7s prior to deadline).
Jul  7 12:26:17 cmsls6 kernel: LustreError: 138-a: cmsprod1-OST002d: A 
client on nid 131.225.190.151 at tcp was evicted due to a lock blocking 
callback to 131.225.190.151 at tcp timed out: rc -107
Jul  7 12:26:17 cmsls6 kernel: LustreError: 
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on destroyed 
export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID lock: 
ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW res: 337742/0 
rrc: 2 type: EXT [0->1048575] (req 0->1048575) flags: 0x0 remote: 
0x6c03f21f59f6b4e6 expref: 19 pid: 15352 timeout 0
Jul  7 12:26:17 cmsls6 kernel: Lustre: 
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
12345-131.225.190.151 at tcp - client will retry
Jul  7 12:26:19 cmsls6 kernel: Lustre: 
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d: ignoring 
bulk IO comm error with 
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d at NET_0x2000083e1be97_UUID id 
12345-131.225.190.151 at tcp - client will retry


Some of these errors seem really bad - like the bulk IO comm error or 
the eviction due to a locking call back.
What should I be looking for here?  I have determined some of the 
messages that say a client has been evicted cause the
OSS thinks its dead are not due the system being down. So what makes the 
OSS think the client is dead?

Also is there any way to determine what files are involved in these errors?

lisa


-------------- next part --------------
A non-text attachment was scrubbed...
Name: lisa.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20110714/15073314/attachment.vcf>


More information about the lustre-discuss mailing list