[Lustre-discuss] [HPDD-discuss] Failure to connect to some OST from a client machine

Kris Howard khoward at eng.utah.edu
Thu Sep 5 13:22:26 PDT 2013


If you lctl ping 10.10.X.XX at tcp from both sides it should bring the route
up.
With all of those down routes all is happy?
hmm.


On Thu, Sep 5, 2013 at 1:14 PM, Bob Ball <ball at umich.edu> wrote:

>  That is an interesting mix.  Nothing shows up at all on the clients, even
> on those 3 that route to a second NIC.  On the OSS, it is quite the mix of
> up/down on the 3 routers, with no obvious pattern.
>
> Most of our traffic is on the 10.10 network, with the 3 machines shown
> below routing to a small number of clients on a more public network.
>
> FYI, the current situation is one in which all machines are happy, as far
> as I can tell.
>
> bob
>
> Running lctl show_route on all machines in lustre_fss.txt
> On umdist05.local
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
>          Succeeded
> On umfs06.local
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp up
>          Succeeded
> On umdist01.local
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
>          Succeeded
> On umdist02.local
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
>          Succeeded
> On umdist03.local
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp up
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
>          Succeeded
> On umdist04.local
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
>          Succeeded
> On umdist07.local
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
>          Succeeded
> On umdist08.local
> net               tcp2 hops 1 gw                   10.10.1.50 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
> net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
>          Succeeded
>
>
> On 9/5/2013 4:01 PM, Kris Howard wrote:
>
> Might check lctl show_route and look for downed routes.
>
>
> On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball <ball at umich.edu> wrote:
>
>> We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
>> 2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4 on SL5.
>>
>> We have had a few situations lately where a client stops talking to some
>> subset of the OST (about 58 of these total on 8 OSS, nearly 500TB in
>> total).  I have a couple of questions.
>>
>> 1. "lctl dl"  on the OSS shows a smaller count on the affected servers;
>> on the client, all OSS showed UP in "lctl dl".  Today, I first tried
>> rebooting this OSS, but that did not change the situation.  I ended up
>> rebooting the client before I could get full connectivity.  Is there any
>> way from the client to get the reconnect, short of rebooting that client?
>>
>> 2. It used to be the case under Lustre 1.8.4 that I could run "lfs df -h"
>> on the client, and see all OST, even those where the connection was not
>> working, for whatever reason.  That is no longer the case, now the lfs
>> command stops at the first, non-talking OST. This seems more like a bug
>> than a feature.  Is there some other way to see a list of non-communicating
>> OST on a client?
>>
>> Thanks in advance for any help offered.
>>
>> bob
>>
>>
>>
>> _______________________________________________
>> HPDD-discuss mailing list
>> HPDD-discuss at lists.01.org
>> https://lists.01.org/mailman/listinfo/hpdd-discuss
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130905/5cbff850/attachment.htm>


More information about the lustre-discuss mailing list