[Lustre-discuss] [HPDD-discuss] Failure to connect to some OST from a client machine

Thu Sep 5 13:14:31 PDT 2013

That is an interesting mix.  Nothing shows up at all on the clients, 
even on those 3 that route to a second NIC.  On the OSS, it is quite the 
mix of up/down on the 3 routers, with no obvious pattern.

Most of our traffic is on the 10.10 network, with the 3 machines shown 
below routing to a small number of clients on a more public network.

FYI, the current situation is one in which all machines are happy, as 
far as I can tell.

bob

Running lctl show_route on all machines in lustre_fss.txt
On umdist05.local
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
          Succeeded
On umfs06.local
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
net               tcp2 hops 1 gw                   10.10.1.52 at tcp up
          Succeeded
On umdist01.local
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
          Succeeded
On umdist02.local
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
          Succeeded
On umdist03.local
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.52 at tcp up
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
          Succeeded
On umdist04.local
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
net               tcp2 hops 1 gw                   10.10.1.50 at tcp up
          Succeeded
On umdist07.local
net               tcp2 hops 1 gw                   10.10.1.50 at tcp down
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
          Succeeded
On umdist08.local
net               tcp2 hops 1 gw                   10.10.1.50 at tcp down
net               tcp2 hops 1 gw                   10.10.1.52 at tcp down
net               tcp2 hops 1 gw                   10.10.1.51 at tcp down
          Succeeded

On 9/5/2013 4:01 PM, Kris Howard wrote:
> Might check lctl show_route and look for downed routes.
>
>
> On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball <ball at umich.edu 
> <mailto:ball at umich.edu>> wrote:
>
>     We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
>     2.6.32-358.11.1.el6.x86_64.  This was an upgrade from Lustre 1.8.4
>     on SL5.
>
>     We have had a few situations lately where a client stops talking
>     to some subset of the OST (about 58 of these total on 8 OSS,
>     nearly 500TB in total).  I have a couple of questions.
>
>     1. "lctl dl"  on the OSS shows a smaller count on the affected
>     servers; on the client, all OSS showed UP in "lctl dl".  Today, I
>     first tried rebooting this OSS, but that did not change the
>     situation.  I ended up rebooting the client before I could get
>     full connectivity.  Is there any way from the client to get the
>     reconnect, short of rebooting that client?
>
>     2. It used to be the case under Lustre 1.8.4 that I could run "lfs
>     df -h" on the client, and see all OST, even those where the
>     connection was not working, for whatever reason.  That is no
>     longer the case, now the lfs command stops at the first,
>     non-talking OST. This seems more like a bug than a feature.  Is
>     there some other way to see a list of non-communicating OST on a
>     client?
>
>     Thanks in advance for any help offered.
>
>     bob
>
>
>
>     _______________________________________________
>     HPDD-discuss mailing list
>     HPDD-discuss at lists.01.org <mailto:HPDD-discuss at lists.01.org>
>     https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130905/b8eab71b/attachment.htm>