[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Eduardo Murrieta emurrieta at nucleares.unam.mx
Thu Oct 17 18:10:50 PDT 2013


Hello Jeff,

Non, this is a lustre filesystem for Instituto de Ciencias Nucleares at
UNAM, we are working on the installation for Alice at DGTIC too, but this
problem is with our local filesystem.

The OST is connected using a LSI-SAS controller, we have 8 OSTs on the same
server, there are nodes that loose connection with all the OSTs that belong
to this server but the problem is not related with the OST-OSS
communication, since I can access this  OST and read files stored there
from other lustre clients.

The problem is a deadlock condition in which the OSS and some clients
refuse connections from each other as I can see from dmesg:

in the client
LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with
10.2.2.3 at o2ib, operation ost_connect failed with -16.

in the server
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4 at o2ib) reconnecting
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs

this only happen with clients that are reading a lot of small files (~100MB
each) in the same OST.

thank you,

Eduardo



2013/10/17 Jeff Johnson <jeff.johnson at aeoncomputing.com>

> Hola Eduardo,
>
> How are the OSTs connected to the OSS (SAS, FC, Infiniband SRP)?
> Are there any non-Lustre errors in the dmesg output of the OSS?
> Block devices error on the OSS (/dev/sd?)?
>
> If you are losing [scsi,sas,fc,srp] connectivity you may see this sort
> of thing. If the OSTs are connected to the OSS node via IB SRP and your
> IB fabric gets busy or you have subnet manager issues you might see a
> condition like this.
>
> Is this the AliceFS at DGTIC?
>
> --Jeff
>
>
>
> On 10/17/13 3:52 PM, Eduardo Murrieta wrote:
> > Hello,
> >
> > this is my first post on this list, I hope someone can give me some
> > advise on how to resolve the following issue.
> >
> > I'm using the lustre release 2.4.0 RC2 compiled from whamcloud
> > sources, this is an upgrade from lustre 2.2.22 from same sources.
> >
> > The situation is:
> >
> > There are several clients reading files that belongs mostly to the
> > same OST, afther a period of time the clients starts loosing contact
> > with this OST and processes stops due to this fault, here is the state
> > for such OST on one client:
> >
> > client# lfs check servers
> > ...
> > ...
> > lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily
> > unavailable
> > ...
> > ...
> >
> > checking dmesg on client and OSS server we have:
> >
> > client# dmesg
> > LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating
> > with 10.2.2.3 at o2ib, operation ost_connect failed with -16.
> > LustreError: Skipped 24 previous similar messages
> >
> > OSS-server# dmesg
> > ....
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4 at o2ib) reconnecting
> > Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9
> > (at 10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
> > ....
> >
> > At this moment I can ping from client to server and vice versa, but
> > some time this call also hangs on server and client.
> >
> > client# # lctl ping OSS-server at o2ib
> > 12345-0 at lo
> > 12345-OSS-server at o2ib
> >
> > OSS-server# lctl ping 10.2.64.4 at o2ib
> > 12345-0 at lo
> > 12345-10.2.64.4 at o2ib
> >
> > This situation happens very frequently and specially with jobs that
> > process a lot of files in an average size of 100MB.
> >
> > The only solution that  I find to reestablish the communication
> > between the server and the client is restarting both machines.
> >
> > I hope some have an idea what is the reason for the problem and how
> > can I reset the communication with the clients without restarting the
> > machines.
> >
> > thank you,
> >
> > Eduardo
> > UNAM at Mexico
> >
> > --
> > Eduardo Murrieta
> > Unidad de Cómputo
> > Instituto de Ciencias Nucleares, UNAM
> > Ph. +52-55-5622-4739 ext. 5103
> >
> >
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>
> --
> ------------------------------
> Jeff Johnson
> Co-Founder
> Aeon Computing
>
> jeff.johnson at aeoncomputing.com
> www.aeoncomputing.com
> t: 858-412-3810 x1001   f: 858-412-3845
> m: 619-204-9061
>
> 4170 Morena Boulevard, Suite D - San Diego, CA 92117
>
> High-performance Computing / Lustre Filesystems / Scale-out Storage
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>



-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20131017/a61d0074/attachment.htm>


More information about the lustre-discuss mailing list