[Lustre-discuss] Broken communication between OSS and Client on Lustre 2.4

Eduardo Murrieta emurrieta at nucleares.unam.mx
Thu Oct 17 15:52:53 PDT 2013


Hello,

this is my first post on this list, I hope someone can give me some advise
on how to resolve the following issue.

I'm using the lustre release 2.4.0 RC2 compiled from whamcloud sources,
this is an upgrade from lustre 2.2.22 from same sources.

The situation is:

There are several clients reading files that belongs mostly to the same
OST, afther a period of time the clients starts loosing contact with this
OST and processes stops due to this fault, here is the state for such OST
on one client:

client# lfs check servers
...
...
lustre-OST000a-osc-ffff8801bc548000: check error: Resource temporarily
unavailable
...
...

checking dmesg on client and OSS server we have:

client# dmesg
LustreError: 11-0: lustre-OST000a-osc-ffff8801bc548000: Communicating with
10.2.2.3 at o2ib, operation ost_connect failed with -16.
LustreError: Skipped 24 previous similar messages

OSS-server# dmesg
....
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4 at o2ib) reconnecting
Lustre: lustre-OST000a: Client 0afb2e4c-d870-47ef-c16f-4d2bce6dabf9 (at
10.2.64.4 at o2ib) refused reconnection, still busy with 9 active RPCs
....

At this moment I can ping from client to server and vice versa, but some
time this call also hangs on server and client.

client# # lctl ping OSS-server at o2ib
12345-0 at lo
12345-OSS-server at o2ib

OSS-server# lctl ping 10.2.64.4 at o2ib
12345-0 at lo
12345-10.2.64.4 at o2ib

This situation happens very frequently and specially with jobs that process
a lot of files in an average size of 100MB.

The only solution that  I find to reestablish the communication between the
server and the client is restarting both machines.

I hope some have an idea what is the reason for the problem and how can I
reset the communication with the clients without restarting the machines.

thank you,

Eduardo
UNAM at Mexico

-- 
Eduardo Murrieta
Unidad de Cómputo
Instituto de Ciencias Nucleares, UNAM
Ph. +52-55-5622-4739 ext. 5103
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20131017/f22e398a/attachment.htm>


More information about the lustre-discuss mailing list