[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Thu Oct 17 19:58:08 PDT 2013

Eduardo,

One or two E5506 CPUs in the OSS? What is the specific LSI controller and
how many of them in the OSS?

I think the OSS is under provisioned for 8 OSTs. I'm betting you run a high
iowait on those sd devices during your problematic run. The iowait probably
grows until deadlock. Can you run the job while running a shell with top on
the OSS. You're likely hitting 99% iowait.

--Jeff

On Thursday, October 17, 2013, Eduardo Murrieta wrote:

> I have this on the debug_file from my OSS:
>
> 00000010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read())
> lustre-OST0000: Bulk IO read error with 0afb2e4c-d
> 870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4 at o2ib), client will retry: rc -107
>
> 00000400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time())
> Service thread pid 3099 completed after 227.00s. This indicates the system
> was overloaded (too many service threads, or there were not enough hardware
> resources).
>
> But I can read without problems files stored on this ODT from other
> clients. For example:
>
> $ lfs find --obd lustre-OST0000 .
> ./src/BLAS/srot.f
> ...
>
> $ more ./src/BLAS/srot.f
>       SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S)
> *     .. Scalar Arguments ..
>       REAL C,S
>       INTEGER INCX,INCY,N
> *     ..
> *     .. Array Arguments ..
>       REAL SX(*),SY(*)
> ...
> ...
>
> This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core
> E5506. Tomorrow I'll increase the memory, if this is the missing resource.
>
>
>
>
>
>
>
>
>
> 2013/10/17 Joseph Landman <landman at scalableinformatics.com>
>
> Are there device or Filesystem level error messages on the server?  This
> almost looks like a corrupted file system.
>
> Please pardon brevity and typos ... Sent from my iPhone
>
> On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta <emurrieta at nucleares.unam.mx>
> wrote:
>
> Hello Jeff,
>
> Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
> UNAM, we are working on the installation for Alice at DGTIC too, but this
> problem is with our local filesystem.
>
> The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
> same server, there are nodes that loose connection with all the OSTs that
> belong to this server but the problem is not related with the OST-OSS
> communication, since I can access this  OST and read files stored there
> from other lustre clients.
>
> The problem is a deadlock condition in which the OSS and some clients
> refuse connections from each other as I can see from dmesg:
>
> in the client
> LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with
> 10.2.2.3 at o2ib, operation ost_connect failed with -16.
>
> in the server
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4 at o2ib) reconnecting
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
>
> this only happen with clients that are reading a lot of small files
> (~100MB each) in the same OST.
>
> thank you,
>
> Eduardo
>
>
>
> 2013/10/17 Jeff Johnson <jeff.johnson at aeoncomputing.com>
>
> Hola Eduardo,
>
> How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
> Are there any non-Lustre errors in the dmesg output of the OSS?
> Block devices error on the OSS (/dev/sd?)?
>
> If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
> of thing. If the OSTs are connected to the OSS node via IB SRP and your
> IB fabric gets busy or you have subnet manager issues you might see a
> condition like this.
>
> Is this the AliceFS at DGTIC?
>
> --Jeff
>
>
>
> On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
> > Hello,
> >
> > this is my first post on this list, I hope someone can give me some
> > advise on how to resolve the following issue.
> >
> > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
> > sources, this is an upgrade from lustre 2.2.22 from same sources.
> >
> > The situation is:
> >
> > There are several clients reading files that belongs mostly to the
> > same OST, afther a period of time the clients starts loosing contact
> > with this OST and processes stops due to this fault, here is the state
> > for such OST on one client:
> >
> > client# lfs check servers
> > ...
> > ...
> > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily
> > unavailable
> > ...
> > ...
> >
> > checking dmesg on client and OSS server we have:
> >
> > client# dmesg
> > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating
> > with 10.2.2.3 at o2ib, operation ost_connect failed with -16.
> > LustreError: Skipped 24 previous similar messages
> >
> > OSS-server# dmesg
> > ....
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4 at o2ib) reconnecting
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
> > ....
> >
> > At this moment I can ping from client to server and vice versa, but
> > some time this call also hangs on server and client.
> >
> > client# # lctl ping OSS-server at o2ib
> > 12345-0 at lo
> > 12345-OSS-server at o2ib
> >
> > OSS-server# lctl ping 10.2.64.4 at o2ib
> > 12345-0 at lo
> > 1234
>
>

-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20131017/2d33de7b/attachment.htm>