[Lustre-discuss] [HPDD-discuss] Failure to connect to some OST from a client machine
Bob Ball
ball at umich.edu
Thu Sep 5 13:14:31 PDT 2013
That is an interesting mix. Nothing shows up at all on the clients,
even on those 3 that route to a second NIC. On the OSS, it is quite the
mix of up/down on the 3 routers, with no obvious pattern.
Most of our traffic is on the 10.10 network, with the 3 machines shown
below routing to a small number of clients on a more public network.
FYI, the current situation is one in which all machines are happy, as
far as I can tell.
bob
Running lctl show_route on all machines in lustre_fss.txt
On umdist05.local
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.50 at tcp up
Succeeded
On umfs06.local
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.50 at tcp up
net tcp2 hops 1 gw 10.10.1.52 at tcp up
Succeeded
On umdist01.local
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.50 at tcp up
Succeeded
On umdist02.local
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.50 at tcp up
Succeeded
On umdist03.local
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.52 at tcp up
net tcp2 hops 1 gw 10.10.1.50 at tcp up
Succeeded
On umdist04.local
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
net tcp2 hops 1 gw 10.10.1.50 at tcp up
Succeeded
On umdist07.local
net tcp2 hops 1 gw 10.10.1.50 at tcp down
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
Succeeded
On umdist08.local
net tcp2 hops 1 gw 10.10.1.50 at tcp down
net tcp2 hops 1 gw 10.10.1.52 at tcp down
net tcp2 hops 1 gw 10.10.1.51 at tcp down
Succeeded
On 9/5/2013 4:01 PM, Kris Howard wrote:
> Might check lctl show_route and look for downed routes.
>
>
> On Thu, Sep 5, 2013 at 12:56 PM, Bob Ball <ball at umich.edu
> <mailto:ball at umich.edu>> wrote:
>
> We are running Lustre 2.1.6 on Scientific Linux 6.4, kernel
> 2.6.32-358.11.1.el6.x86_64. This was an upgrade from Lustre 1.8.4
> on SL5.
>
> We have had a few situations lately where a client stops talking
> to some subset of the OST (about 58 of these total on 8 OSS,
> nearly 500TB in total). I have a couple of questions.
>
> 1. "lctl dl" on the OSS shows a smaller count on the affected
> servers; on the client, all OSS showed UP in "lctl dl". Today, I
> first tried rebooting this OSS, but that did not change the
> situation. I ended up rebooting the client before I could get
> full connectivity. Is there any way from the client to get the
> reconnect, short of rebooting that client?
>
> 2. It used to be the case under Lustre 1.8.4 that I could run "lfs
> df -h" on the client, and see all OST, even those where the
> connection was not working, for whatever reason. That is no
> longer the case, now the lfs command stops at the first,
> non-talking OST. This seems more like a bug than a feature. Is
> there some other way to see a list of non-communicating OST on a
> client?
>
> Thanks in advance for any help offered.
>
> bob
>
>
>
> _______________________________________________
> HPDD-discuss mailing list
> HPDD-discuss at lists.01.org <mailto:HPDD-discuss at lists.01.org>
> https://lists.01.org/mailman/listinfo/hpdd-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20130905/b8eab71b/attachment.htm>
More information about the lustre-discuss
mailing list