[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Thu Oct 17 19:36:36 PDT 2013

I have this on the debug_file from my OSS:

00000010:02000400:0.0:1382055634.785734:0:3099:0:(ost_handler.c:940:ost_brw_read())
lustre-OST0000: Bulk IO read error with 0afb2e4c-d
870-47ef-c16f-4d2bce6dabf9 (at 10.2.64.4 at o2ib), client will retry: rc -107

00000400:02000400:0.0:1382055634.786061:0:3099:0:(watchdog.c:411:lcw_update_time())
Service thread pid 3099 completed after 227.00s. This indicates the system
was overloaded (too many service threads, or there were not enough hardware
resources).

But I can read without problems files stored on this ODT from other
clients. For example:

$ lfs find --obd lustre-OST0000 .
./src/BLAS/srot.f
...

$ more ./src/BLAS/srot.f
      SUBROUTINE SROT(N,SX,INCX,SY,INCY,C,S)
*     .. Scalar Arguments ..
      REAL C,S
      INTEGER INCX,INCY,N
*     ..
*     .. Array Arguments ..
      REAL SX(*),SY(*)
...
...

This OSS have 8 ODTs of 14 TB each, with 12 GB/RAM and Xeon Quad Core
E5506. Tomorrow I'll increase the memory, if this is the missing resource.

2013/10/17 Joseph Landman <landman at scalableinformatics.com>

> Are there device or Filesystem level error messages on the server?  This
> almost looks like a corrupted file system.
>
> Please pardon brevity and typos ... Sent from my iPhone
>
> On Oct 17, 2013, at 6:11 PM, Eduardo Murrieta <emurrieta at nucleares.unam.mx>
> wrote:
>
> Hello Jeff,
>
> Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
> UNAM, we are working on the installation for Alice at DGTIC too, but this
> problem is with our local filesystem.
>
> The OST is connected using a LSI-SAS controller, we have 8 OSTs on the
> same server, there are nodes that loose connection with all the OSTs that
> belong to this server but the problem is not related with the OST-OSS
> communication, since I can access this  OST and read files stored there
> from other lustre clients.
>
> The problem is a deadlock condition in which the OSS and some clients
> refuse connections from each other as I can see from dmesg:
>
> in the client
> LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with
> 10.2.2.3 at o2ib, operation ost_connect failed with -16.
>
> in the server
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4 at o2ib) reconnecting
> Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
> 10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
>
> this only happen with clients that are reading a lot of small files
> (~100MB each) in the same OST.
>
> thank you,
>
> Eduardo
>
>
>
> 2013/10/17 Jeff Johnson <jeff.johnson at aeoncomputing.com>
>
>> Hola Eduardo,
>>
>> How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
>> Are there any non-Lustre errors in the dmesg output of the OSS?
>> Block devices error on the OSS (/dev/sd?)?
>>
>> If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
>> of thing. If the OSTs are connected to the OSS node via IB SRP and your
>> IB fabric gets busy or you have subnet manager issues you might see a
>> condition like this.
>>
>> Is this the AliceFS at DGTIC?
>>
>> --Jeff
>>
>>
>>
>> On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
>> > Hello,
>> >
>> > this is my first post on this list, I hope someone can give me some
>> > advise on how to resolve the following issue.
>> >
>> > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
>> > sources, this is an upgrade from lustre 2.2.22 from same sources.
>> >
>> > The situation is:
>> >
>> > There are several clients reading files that belongs mostly to the
>> > same OST, afther a period of time the clients starts loosing contact
>> > with this OST and processes stops due to this fault, here is the state
>> > for such OST on one client:
>> >
>> > client# lfs check servers
>> > ...
>> > ...
>> > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily
>> > unavailable
>> > ...
>> > ...
>> >
>> > checking dmesg on client and OSS server we have:
>> >
>> > client# dmesg
>> > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating
>> > with 10.2.2.3 at o2ib, operation ost_connect failed with -16.
>> > LustreError: Skipped 24 previous similar messages
>> >
>> > OSS-server# dmesg
>> > ....
>> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
>> > (at 10.2.64.4 at o2ib) reconnecting
>> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
>> > (at 10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
>> > ....
>> >
>> > At this moment I can ping from client to server and vice versa, but
>> > some time this call also hangs on server and client.
>> >
>> > client# # lctl ping OSS-server at o2ib
>> > 12345-0 at lo
>> > 12345-OSS-server at o2ib
>> >
>> > OSS-server# lctl ping 10.2.64.4 at o2ib
>> > 12345-0 at lo
>> > 12345-10.2.64.4 at o2ib
>> >
>> > This situation happens very frequently and specially with jobs that
>> > process a lot of files in an average size of 100MB.
>> >
>> > The only solution that  I find to reestablish the communication
>> > between the server and the client is restarting both machines.
>> >
>> > I hope some have an idea what is the reason for the problem and how
>> > can I reset the communication with the clients without restarting the
>> > machines.
>> >
>> > thank you,
>> >
>> > Eduardo
>> > UNAM at Mexico
>> >
>> > --
>> > Eduardo Murrieta
>> > Unidad de Cómputo
>> > Instituto de Ciencias Nucleares, UNAM
>> > Ph. +52-55-5622-4739 ext. 5103
>> >
>> >
>> >
>> > _______________________________________________
>> > Lustre-discuss mailing list
>> > Lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>>
>> --
>> ------------------------------
>> Jeff Johnson
>> Co-Founder
>> Aeon Computing
>>
>> jeff.johnson at aeoncomputing.com
>> www.aeoncomputing.com
>> t: 858-412-3810 x1001   f: 858-412-3845
>> m: 619-204-9061
>>
>> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>>
>> High-performance Computing / Lustre Filesystems / Scale-out Storage
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
>
> --
> Eduardo Murrieta
> Unidad de Cómputo
> Instituto de Ciencias Nucleares, UNAM
> Ph. +52-55-5622-4739 ext. 5103
>
>  _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20131017/757a43e1/attachment.htm>