Eduardo,<div><br></div><div>One or two E5506 CPUs in the OSS? What is the specific LSI controller and how many of them in the OSS? </div><div><br></div><div>I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high iowait on those sd devices during your problematic run. The iowait probably grows until deadlock. Can you run the job while running a shell with top on the OSS. You're likely hitting 99% iowait.</div>

<div><br></div><div>--Jeff<span></span></div><div><br>On Thursday, October 17, 2013, Eduardo Murrieta  wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">

<div><div>I have this on the debug_file from my OSS:<br><br><span style="font-family:courier new,monospace">00000010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read()) lustre-OST0000: Bulk IO read error with 0afb2e4c-d<br>


870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib), client will retry: rc -107<br><br>00000400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time()) Service thread pid 3099 completed after 227.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).</span><br>


<br></div>But I can read without problems files stored on this ODT from other clients. For example:<br><br><span style="font-family:courier new,monospace">$ lfs find --obd lustre-OST0000 .<br>./src/BLAS/srot.f</span><br>

...<br>

<span style="font-family:courier new,monospace"><br></span></div><span style="font-family:courier new,monospace">$ more ./src/BLAS/srot.f<br>      SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S)<br>*     .. Scalar Arguments ..<br>


      REAL C,S<br>      INTEGER INCX,INCY,N<br>*     ..<br>*     .. Array Arguments ..<br>      REAL SX(*),SY(*)<br>...<br>...</span><br><br><div>This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core E5506. Tomorrow I'll increase the memory, if this is the missing resource.<br>


<br><br></div><div><br><br><div><br><br><br></div></div></div><div><br><br><div>2013/10/17 Joseph Landman <span dir="ltr"><<a>landman@scalableinformatics.com</a>></span><br>

<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div>Are there device or Filesystem level error messages on the server?  This almost looks like a corrupted file system.<br>


<br>Please p<span>ardon brevity and typos ... Sent from my iPhone</span></div><div><div><div><br>On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta <<a>emurrieta@nucleares.unam.mx</a>> wrote:<br>


<br></div><blockquote type="cite"><div><div dir="ltr"><div><div><div>Hello Jeff,<br><br></div>Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at UNAM, we are working on the installation for Alice at DGTIC too, but this problem is with our local filesystem.<br>


<br></div>The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same server, there are nodes that loose connection with all the OSTs that belong to this server but the problem is not related with the OST-OSS communication, since I can access this  OST and read files stored there from other lustre clients.<br>


<br></div>The problem is a deadlock condition in which the OSS and some clients refuse connections from each other as I can see from dmesg:<br><br>in the client<br><span style="font-family:courier new,monospace">LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with 10.2.2.3@o2ib, operation ost_connect failed with -16.<br>


<br></span><div>in the server<br><span style="font-family:courier new,monospace">Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) reconnecting<br>Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs</span><br>


 <br></div><div>this only happen with clients that are reading a lot of small files (~100MB each) in the same OST. <br></div><div><br></div><div>thank you,<br><br></div><div>Eduardo<br></div><div><br></div></div><div>


<br><br><div>2013/10/17 Jeff Johnson <span dir="ltr"><<a>jeff.johnson@aeoncomputing.com</a>></span><br><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Hola Eduardo,<br>

<br>

How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?<br>

Are there any non-Lustre errors in the dmesg output of the OSS?<br>

Block devices error on the OSS (/dev/sd?)?<br>

<br>

If you are losing [scsi,sas,fc,srp] connectivity you may see this sort<br>

of thing. If the OSTs are connected to the OSS node via IB SRP and your<br>

IB fabric gets busy or you have subnet manager issues you might see a<br>

condition like this.<br>

<br>

Is this the AliceFS at DGTIC?<br>

<br>

--Jeff<br>

<div><div><br>

<br>

<br>

On 10/17/13 3:52 PM, Eduardo Murrieta wrote:<br>

> Hello,<br>

><br>

> this is my first post on this list, I hope someone can give me some<br>

> advise on how to resolve the following issue.<br>

><br>

> I'm using the lustre release 2.4.0 RC2 compiled from whamcloud<br>

> sources, this is an upgrade from lustre 2.2.22 from same sources.<br>

><br>

> The situation is:<br>

><br>

> There are several clients reading files that belongs mostly to the<br>

> same OST, afther a period of time the clients starts loosing contact<br>

> with this OST and processes stops due to this fault, here is the state<br>

> for such OST on one client:<br>

><br>

> client# lfs check servers<br>

> ...<br>

> ...<br>

> lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily<br>

> unavailable<br>

> ...<br>

> ...<br>

><br>

> checking dmesg on client and OSS server we have:<br>

><br>

> client# dmesg<br>

> LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating<br>

> with 10.2.2.3@o2ib, operation ost_connect failed with -16.<br>

> LustreError: Skipped 24 previous similar messages<br>

><br>

> OSS-server# dmesg<br>

> ....<br>

> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9<br>

> (at 10.2.64.4@o2ib) reconnecting<br>

> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9<br>

> (at 10.2.64.4@o2ib) refused reconnection, still busy with 9 active RPCs<br>

> ....<br>

><br>

> At this moment I can ping from client to server and vice versa, but<br>

> some time this call also hangs on server and client.<br>

><br>

> client# # lctl ping OSS-server@o2ib<br>

> 12345-0@lo<br>

> 12345-OSS-server@o2ib<br>

><br>

> OSS-server# lctl ping 10.2.64.4@o2ib<br>

> 12345-0@lo<br>

> 1234</div></div></blockquote></div></div></div></blockquote></div></div></div></blockquote></div></div></blockquote></div><br><br>-- <br><div dir="ltr">------------------------------<br>Jeff Johnson<br>Co-Founder<br>

Aeon Computing<br><br><a href="mailto:jeff.johnson@aeoncomputing.com" target="_blank">jeff.johnson@aeoncomputing.com</a><br><a href="http://www.aeoncomputing.com" target="_blank">www.aeoncomputing.com</a><br>t: 858-412-3810 x1001   f: 858-412-3845<br>

m: 619-204-9061<br><br>4170 Morena Boulevard, Suite D - San Diego, CA 92117<div><br></div><div>High-Performance Computing / Lustre Filesystems / Scale-out Storage</div></div><br>