<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
I am running 1.8.3 on servers and clients.<br>
lisa<br>
<br>
On 7/14/11 12:59 PM, Lisa Giacchetti wrote:
<blockquote cite="mid:4E1F2E70.2060909@fnal.gov" type="cite">Hi,
<br>
We are seeing a problem where some running jobs attempted to copy
a file from local disk
<br>
on a worker node to a lustre file system. 14 of those files ended
up empty or truncated.
<br>
<br>
We have 7 OSSs with either 6 or 12 ost's on each. All 14 files
ended up being on an ost on
<br>
one of the two systems that have 12 osts. There are 12 different
OST's involved.
<br>
<br>
So if I look at the messages file on one of those OSS's and I
specifically look for messages
<br>
related to one of the OST's that have a truncated or empty file I
see things like this:
<br>
<br>
Jul 7 07:10:08 cmsls6 kernel: Lustre:
15431:0:(ldlm_lib.c:575:target_handle_reconnect())
cmsprod1-OST002d: c03badd9-c242-1507-6824-3a9648c8b21f
reconnecting
<br>
Jul 7 07:59:42 cmsls6 kernel: Lustre:
3272:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905647245 sent from cmsprod1-OST002d to NID
131.225.191.35@tcp 7s ago has timed out (7s prior to deadline).
<br>
Jul 7 07:59:42 cmsls6 kernel: LustreError: 138-a:
cmsprod1-OST002d: A client on nid 131.225.191.35@tcp was evicted
due to a lock completion callback to 131.225.191.35@tcp timed out:
rc -107
<br>
Jul 7 09:26:58 cmsls6 kernel: Lustre:
15433:0:(ldlm_lib.c:575:target_handle_reconnect())
cmsprod1-OST002d: 9235f65e-ff71-2b1f-60fb-c049cbad5728
reconnecting
<br>
Jul 7 09:53:50 cmsls6 kernel: Lustre:
2663:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905668862 sent from cmsprod1-OST002d to NID
131.225.204.88@tcp 7s ago has timed out (7s prior to deadline).
<br>
Jul 7 09:53:50 cmsls6 kernel: LustreError: 138-a:
cmsprod1-OST002d: A client on nid 131.225.204.88@tcp was evicted
due to a lock blocking callback to 131.225.204.88@tcp timed out:
rc -107
<br>
Jul 7 10:18:57 cmsls6 kernel: LustreError: 138-a:
cmsprod1-OST002d: A client on nid 131.225.207.176@tcp was evicted
due to a lock blocking callback to 131.225.207.176@tcp timed out:
rc -107
<br>
Jul 7 10:23:01 cmsls6 kernel: Lustre:
15405:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905675944 sent from cmsprod1-OST002d to NID
131.225.204.118@tcp 7s ago has timed out (7s prior to deadline).
<br>
Jul 7 11:06:31 cmsls6 kernel: Lustre:
15341:0:(ldlm_lib.c:575:target_handle_reconnect())
cmsprod1-OST002d: e25b2761-680a-4d94-ed2c-10913403c0a3
reconnecting
<br>
Jul 7 12:26:17 cmsls6 kernel: Lustre:
15352:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1359120905703492 sent from cmsprod1-OST002d to NID
131.225.190.151@tcp 7s ago has timed out (7s prior to deadline).
<br>
Jul 7 12:26:17 cmsls6 kernel: LustreError: 138-a:
cmsprod1-OST002d: A client on nid 131.225.190.151@tcp was evicted
due to a lock blocking callback to 131.225.190.151@tcp timed out:
rc -107
<br>
Jul 7 12:26:17 cmsls6 kernel: LustreError:
15352:0:(ldlm_lockd.c:1167:ldlm_handle_enqueue()) ### lock on
destroyed export ffff810c3926f400 ns: filter-cmsprod1-OST002d_UUID
lock: ffff8109c7f21a00/0xf22d54118e04e04d lrc: 3/0,0 mode: --/PW
res: 337742/0 rrc: 2 type: EXT [0->1048575] (req 0->1048575)
flags: 0x0 remote: 0x6c03f21f59f6b4e6 expref: 19 pid: 15352
timeout 0
<br>
Jul 7 12:26:17 cmsls6 kernel: Lustre:
2740:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d:
ignoring bulk IO comm error with
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x2000083e1be97_UUID id
12345-131.225.190.151@tcp - client will retry
<br>
Jul 7 12:26:19 cmsls6 kernel: Lustre:
2742:0:(ost_handler.c:1219:ost_brw_write()) cmsprod1-OST002d:
ignoring bulk IO comm error with
f81d3629-7e6a-1b5d-810e-ad73d7f5c90d@NET_0x2000083e1be97_UUID id
12345-131.225.190.151@tcp - client will retry
<br>
<br>
<br>
Some of these errors seem really bad - like the bulk IO comm error
or the eviction due to a locking call back.
<br>
What should I be looking for here? I have determined some of the
messages that say a client has been evicted cause the
<br>
OSS thinks its dead are not due the system being down. So what
makes the OSS think the client is dead?
<br>
<br>
Also is there any way to determine what files are involved in
these errors?
<br>
<br>
lisa
<br>
<br>
<br>
<pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
Lustre-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a>
<a class="moz-txt-link-freetext" href="http://lists.lustre.org/mailman/listinfo/lustre-discuss">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a>
</pre>
</blockquote>
<br>
</body>
</html>