[lustre-discuss] client fails to mount

E.S. Rosenberg esr+lustre at mail.hebrew.edu
Mon May 1 06:15:49 PDT 2017


On Mon, May 1, 2017 at 3:45 PM, Strikwerda, Ger <g.j.c.strikwerda at rug.nl>
wrote:

> Hi Eli,
>
> We have a 180+ compute-cluster IB/10Gb connected with Lustre storage IB/10
> Gb connected. We have multiple IB switches with the master/core/big switch
> manageable via webmanagement. This is switch is a Mellanox SX6036 FDR
> switch. 1 subnet manager is supposed to be running at this switch. And
> using 'sminfo' on the clients we got info about the subnet manager being
> alive. But when we looked via the webmanagement the subnet-manager was
> unstable. The reason why is unknown. Could be faulty firmware. During the
> weekend the system was running fine.
>
Did anything specific make you look in the switch, or just after all other
things were checked you checked there?

>
>
>
>
>
>
> On Mon, May 1, 2017 at 2:18 PM, E.S. Rosenberg <esr+lustre at mail.hebrew.edu
> > wrote:
>
>>
>>
>> On Mon, May 1, 2017 at 11:46 AM, Strikwerda, Ger <g.j.c.strikwerda at rug.nl
>> > wrote:
>>
>>> Hi all,
>>>
>>> Our clients-failed-to-mount/lctl ping horror, turned out to be a failing
>>> subnet manager issue. We did no see an issue runnning 'sminfo' but on the
>>> IB switch we could see that the subnetmanager was unstable. This caused
>>> mayhem on the IB/Lustre setup.
>>>
>> Can you describe a bit more of how you found this?
>> You are running an SM on the switches?
>> Like this if someone else runs into this they will be able to check this
>> too....
>>
>>>
>>> Thanks everybody for their help/advice/hints. Good to see how this
>>> active community works!
>>>
>> Indeed.
>> Eli
>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 25, 2017 at 8:17 PM, E.S. Rosenberg <
>>> esr+lustre at mail.hebrew.edu> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Apr 25, 2017 at 7:41 PM, Oucharek, Doug S <
>>>> doug.s.oucharek at intel.com> wrote:
>>>>
>>>>> That specific message happens when the “magic” u32 field at the start
>>>>> of a message does not match what we are expecting.  We do check if the
>>>>> message was transmitted as a different endian from us so when you see this
>>>>> error, we assume that message has been corrupted or the sender is using an
>>>>> invalid magic value.  I don’t believe this value has changed in the history
>>>>> of the LND so this is more likely corruption of some sort.
>>>>>
>>>>
>>>> OT: this information should probably be added to LU-2977 which
>>>> specifically includes the question: What does "consumer defined fatal
>>>> error" mean and why is this connection rejected?
>>>>
>>>>
>>>>
>>>>> Doug
>>>>>
>>>>> > On Apr 25, 2017, at 2:29 AM, Dilger, Andreas <
>>>>> andreas.dilger at intel.com> wrote:
>>>>> >
>>>>> > I'm not an LNet expert, but I think the critical issue to focus on
>>>>> is:
>>>>> >
>>>>> >  Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>>>> .el6.x86_64
>>>>> >  LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>>>>> >  LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>>>> 172.23.55.211 at o2ib rejected: consumer defined fatal error
>>>>> >
>>>>> > This means that the LND didn't connect at startup time, but I don't
>>>>> know what the cause is.
>>>>> > The error that generates this message is IB_CM_REJ_CONSUMER_DEFINED,
>>>>> but I don't know enough about IB to tell you what that means.  Some of the
>>>>> later code is checking for mismatched Lustre versions, but it doesn't even
>>>>> get that far.
>>>>> >
>>>>> > Cheers, Andreas
>>>>> >
>>>>> >> On Apr 25, 2017, at 02:21, Strikwerda, Ger <g.j.c.strikwerda at rug.nl>
>>>>> wrote:
>>>>> >>
>>>>> >> Hi Raj,
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# lustre_rmmod
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# modprobe -v lustre
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/n
>>>>> et/lustre/libcfs.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/lvfs.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/net/lustre/lnet.ko
>>>>> networks=o2ib(ib0)
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/obdclass.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/ptlrpc.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/fid.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/mdc.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/osc.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/lov.ko
>>>>> >> insmod /lib/modules/2.6.32-642.6.2.el6.x86_64/weak-updates/kernel/f
>>>>> s/lustre/lustre.ko
>>>>> >>
>>>>> >> dmesg:
>>>>> >>
>>>>> >> LNet: HW CPU cores: 24, npartitions: 4
>>>>> >> alg: No test for crc32 (crc32-table)
>>>>> >> alg: No test for adler32 (adler32-zlib)
>>>>> >> alg: No test for crc32 (crc32-pclmul)
>>>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>>>> .el6.x86_64
>>>>> >> LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>>>>> >>
>>>>> >> But no luck,
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
>>>>> >> failed to ping 172.23.55.211 at o2ib: Input/output error
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# mount /home
>>>>> >> mount.lustre: mount 172.23.55.211 at o2ib:172.23.55.212 at o2ib:/pghome01
>>>>> at /home failed: Input/output error
>>>>> >> Is the MGS running?
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 7:53 PM, Raj <rajgautam at gmail.com> wrote:
>>>>> >> Yes, this is strange. Normally, I have seen that credits mismatch
>>>>> results this scenario but it doesn't look like this is the case.
>>>>> >>
>>>>> >> You wouldn't want to put mgs into capture debug messages as there
>>>>> will be a lot of data.
>>>>> >>
>>>>> >> I guess you already tried removing the lustre drivers and adding it
>>>>> again ?
>>>>> >> lustre_rmmod
>>>>> >> modprobe -v lustre
>>>>> >>
>>>>> >> And check dmesg for any errors...
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 12:43 PM Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >> Hi Raj,
>>>>> >>
>>>>> >> When i do a lctl ping on a MGS server i do not see any logs at all.
>>>>> Also not when i do a sucessfull ping from a working node. Is there a way to
>>>>> verbose the Lustre logging to see more detail on the LNET level?
>>>>> >>
>>>>> >> It is very strange that a rebooted node is able to lctl ping
>>>>> compute nodes, but fails to lctl ping metadata and storage nodes.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 7:35 PM, Raj <rajgautam at gmail.com> wrote:
>>>>> >> Ger,
>>>>> >> It looks like default configuration of lustre.
>>>>> >>
>>>>> >> Do you see any error message on the MGS side while you are doing
>>>>> lctl ping from the rebooted clients?
>>>>> >> On Mon, Apr 24, 2017 at 12:27 PM Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >> Hi Eli,
>>>>> >>
>>>>> >> Nothing can be mounted on the Lustre filesystems so the output is:
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# lfs df /home/ger/
>>>>> >> [root at pg-gpu01 ~]#
>>>>> >>
>>>>> >> Empty..
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 7:24 PM, E.S. Rosenberg <esr at cs.huji.ac.il>
>>>>> wrote:
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 8:19 PM, Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >> Hallo Eli,
>>>>> >>
>>>>> >> Logfile/syslog on the client-side:
>>>>> >>
>>>>> >> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>>>> .el6.x86_64
>>>>> >> LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>>>>> >> LNetError: 2878:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>>>> 172.23.55.211 at o2ib rejected: consumer defined fatal error
>>>>> >>
>>>>> >> lctl df /path/to/some/file
>>>>> >>
>>>>> >> gives nothing useful? (the second one will dump *a lot*)
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 7:16 PM, E.S. Rosenberg <
>>>>> esr+lustre at mail.hebrew.edu> wrote:
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 8:13 PM, Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >> Hi Raj (and others),
>>>>> >>
>>>>> >> In which file should i state the credits/peer_credits stuff?
>>>>> >>
>>>>> >> Perhaps relevant config-files:
>>>>> >>
>>>>> >> [root at pg-gpu01 ~]# cd /etc/modprobe.d/
>>>>> >>
>>>>> >> [root at pg-gpu01 modprobe.d]# ls
>>>>> >> anaconda.conf   blacklist-kvm.conf      dist-alsa.conf
>>>>> dist-oss.conf           ib_ipoib.conf  lustre.conf  openfwwf.conf
>>>>> >> blacklist.conf  blacklist-nouveau.conf  dist.conf
>>>>>  freeipmi-modalias.conf  ib_sdp.conf    mlnx.conf    truescale.conf
>>>>> >>
>>>>> >> [root at pg-gpu01 modprobe.d]# cat ./ib_ipoib.conf
>>>>> >> alias netdev-ib* ib_ipoib
>>>>> >>
>>>>> >> [root at pg-gpu01 modprobe.d]# cat ./mlnx.conf
>>>>> >> # Module parameters for MLNX_OFED kernel modules
>>>>> >>
>>>>> >> [root at pg-gpu01 modprobe.d]# cat ./lustre.conf
>>>>> >> options lnet networks=o2ib(ib0)
>>>>> >>
>>>>> >> Are there more Lustre/LNET options that could help in this
>>>>> situation?
>>>>> >>
>>>>> >> What about the logfiles?
>>>>> >> Any error messages in syslog? lctl debug options?
>>>>> >> Veel geluk,
>>>>> >> Eli
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 7:02 PM, Raj <rajgautam at gmail.com> wrote:
>>>>> >> May be worth checking your lnet credits and peer_credits in
>>>>> /etc/modprobe.d ?
>>>>> >> You can compare between working hosts and non working hosts.
>>>>> >> Thanks
>>>>> >> _Raj
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >> Hi Rick,
>>>>> >>
>>>>> >> Even without iptables rules and loading the correct modules
>>>>> afterwards, we get the same results:
>>>>> >>
>>>>> >> [root at pg-gpu01 sysconfig]# iptables --list
>>>>> >> Chain INPUT (policy ACCEPT)
>>>>> >> target     prot opt source               destination
>>>>> >>
>>>>> >> Chain FORWARD (policy ACCEPT)
>>>>> >> target     prot opt source               destination
>>>>> >>
>>>>> >> Chain OUTPUT (policy ACCEPT)
>>>>> >> target     prot opt source               destination
>>>>> >>
>>>>> >> Chain LOGDROP (0 references)
>>>>> >> target     prot opt source               destination
>>>>> >> LOG        all  --  anywhere             anywhere            LOG
>>>>> level warning
>>>>> >> DROP       all  --  anywhere             anywhere
>>>>> >>
>>>>> >> [root at pg-gpu01 sysconfig]# modprobe lnet
>>>>> >>
>>>>> >> [root at pg-gpu01 sysconfig]# modprobe lustre
>>>>> >>
>>>>> >> [root at pg-gpu01 sysconfig]# lctl ping 172.23.55.211 at o2ib
>>>>> >>
>>>>> >> failed to ping 172.23.55.211 at o2ib: Input/output error
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr)
>>>>> <rmohr at utk.edu> wrote:
>>>>> >> This might be a long shot, but have you checked for possible
>>>>> firewall rules that might be causing the issue?  I’m wondering if there is
>>>>> a chance that some rules were added after the nodes were up to allow Lustre
>>>>> access, and when a node got rebooted, it lost the rules.
>>>>> >>
>>>>> >> --
>>>>> >> Rick Mohr
>>>>> >> Senior HPC System Administrator
>>>>> >> National Institute for Computational Sciences
>>>>> >> http://www.nics.tennessee.edu
>>>>> >>
>>>>> >>
>>>>> >>> On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <
>>>>> g.j.c.strikwerda at rug.nl> wrote:
>>>>> >>>
>>>>> >>> Hi Russell,
>>>>> >>>
>>>>> >>> Thanks for the IB subnet clues:
>>>>> >>>
>>>>> >>> [root at pg-gpu01 ~]# ibv_devinfo
>>>>> >>> hca_id: mlx4_0
>>>>> >>>        transport:                      InfiniBand (0)
>>>>> >>>        fw_ver:                         2.32.5100
>>>>> >>>        node_guid:                      f452:1403:00f5:4620
>>>>> >>>        sys_image_guid:                 f452:1403:00f5:4623
>>>>> >>>        vendor_id:                      0x02c9
>>>>> >>>        vendor_part_id:                 4099
>>>>> >>>        hw_ver:                         0x1
>>>>> >>>        board_id:                       MT_1100120019
>>>>> >>>        phys_port_cnt:                  1
>>>>> >>>                port:   1
>>>>> >>>                        state:                  PORT_ACTIVE (4)
>>>>> >>>                        max_mtu:                4096 (5)
>>>>> >>>                        active_mtu:             4096 (5)
>>>>> >>>                        sm_lid:                 1
>>>>> >>>                        port_lid:               185
>>>>> >>>                        port_lmc:               0x00
>>>>> >>>                        link_layer:             InfiniBand
>>>>> >>>
>>>>> >>> [root at pg-gpu01 ~]# sminfo
>>>>> >>> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count
>>>>> 80878098 priority 0 state 3 SMINFO_MASTER
>>>>> >>>
>>>>> >>> Looks like the rebooted node is able to connect/contact IB/IB
>>>>> subnetmanager
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu>
>>>>> wrote:
>>>>> >>> At first glance, this sounds like your Infiniband subnet manager
>>>>> may
>>>>> >>> be down or malfunctioning. In this case, nodes which were already
>>>>> up
>>>>> >>> when the subnet manager was working will continue to be able to
>>>>> >>> communicate over IB, but nodes which reboot after the SM goes down
>>>>> >>> will not.
>>>>> >>>
>>>>> >>> You can test this theory by running the 'ibv_devinfo' command on
>>>>> one
>>>>> >>> of your rebooted nodes. If the relevant IB port is in state
>>>>> PORT_INIT,
>>>>> >>> this confirms there is a problem with your subnet manager.
>>>>> >>>
>>>>> >>> Sincerely,
>>>>> >>> Rusty Dekema
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>>>>> >>> <g.j.c.strikwerda at rug.nl> wrote:
>>>>> >>>> Hi everybody,
>>>>> >>>>
>>>>> >>>> Here at the university of Groningen we are now experiencing a
>>>>> strange Lustre
>>>>> >>>> error. If a client reboots, it fails to mount the Lustre storage.
>>>>> The client
>>>>> >>>> is not able to reach the MSG service. The storage and nodes are
>>>>> >>>> communicating over IB and unitil now without any problems. It
>>>>> looks like an
>>>>> >>>> issue inside LNET. Clients cannot LNET ping/connect the metadata
>>>>> and or
>>>>> >>>> storage. But the clients are able to LNET ping each other.
>>>>> Clients which not
>>>>> >>>> have been rebooted, are working fine and have their mounts on our
>>>>> Lustre
>>>>> >>>> filesystem.
>>>>> >>>>
>>>>> >>>> Lustre client log:
>>>>> >>>>
>>>>> >>>> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573
>>>>> .el6.x86_64
>>>>> >>>> LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>>>>> >>>>
>>>>> >>>> LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration
>>>>> from log
>>>>> >>>> 'pgdata01-client' failed (-5). This may be the result of
>>>>> communication
>>>>> >>>> errors between this node and the MGS, a bad configuration, or
>>>>> other errors.
>>>>> >>>> See the syslog for more information.
>>>>> >>>> LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to
>>>>> process
>>>>> >>>> log: -5
>>>>> >>>> Lustre: Unmounted pgdata01-client
>>>>> >>>> LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super())
>>>>> Unable to mount
>>>>> >>>> (-5)
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>>>> 172.23.55.212 at o2ib
>>>>> >>>> rejected: consumer defined fatal error
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped
>>>>> 1 previous
>>>>> >>>> similar message
>>>>> >>>> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@
>>>>> Request sent
>>>>> >>>> has failed due to network error: [sent 1492789626/real 1492789626]
>>>>> >>>> req at ffff88105af2cc00 x1565303228072004/t0(0)
>>>>> >>>> o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens
>>>>> 400/544 e 0 to 1
>>>>> >>>> dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>>>>> >>>> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request())
>>>>> Skipped 1
>>>>> >>>> previous similar message
>>>>> >>>> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req())
>>>>> @@@ send limit
>>>>> >>>> expired   req at ffff882041ffc000 x1565303228071996/t0(0)
>>>>> >>>> o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens
>>>>> 328/344 e 0 to 0
>>>>> >>>> dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
>>>>> >>>> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req())
>>>>> Skipped 2
>>>>> >>>> previous similar messages
>>>>> >>>> LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration
>>>>> from log
>>>>> >>>> 'pghome01-client' failed (-5). This may be the result of
>>>>> communication
>>>>> >>>> errors between this node and the MGS, a bad configuration, or
>>>>> other errors.
>>>>> >>>> See the syslog for more information.
>>>>> >>>> LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to
>>>>> process
>>>>> >>>> log: -5
>>>>> >>>>
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>>>> 172.23.55.212 at o2ib
>>>>> >>>> rejected: consumer defined fatal error
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped
>>>>> 1 previous
>>>>> >>>> similar message
>>>>> >>>> LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>>>>> >>>> 172.23.55.211 at o2ib failed: 5
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>>>>> 172.23.55.211 at o2ib
>>>>> >>>> rejected: consumer defined fatal error
>>>>> >>>> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped
>>>>> 1 previous
>>>>> >>>> similar message
>>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed())
>>>>> Deleting
>>>>> >>>> messages for 172.23.55.211 at o2ib: connection failed
>>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed())
>>>>> Deleting
>>>>> >>>> messages for 172.23.55.212 at o2ib: connection failed
>>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>>>>> >>>> 172.23.55.212 at o2ib failed: 5
>>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17
>>>>> previous
>>>>> >>>> similar messages
>>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed())
>>>>> Deleting
>>>>> >>>> messages for 172.23.55.211 at o2ib: connection failed
>>>>> >>>> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>>>>> >>>> 172.23.55.212 at o2ib failed: 5
>>>>> >>>> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed())
>>>>> Deleting
>>>>> >>>> messages for 172.23.55.212 at o2ib: connection failed
>>>>> >>>>
>>>>> >>>> LNET ping of a metadata-node:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
>>>>> >>>> failed to ping 172.23.55.211 at o2ib: Input/output error
>>>>> >>>>
>>>>> >>>> LNET ping of the number 2 metadata-node:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
>>>>> >>>> failed to ping 172.23.55.212 at o2ib: Input/output error
>>>>> >>>>
>>>>> >>>> LNET ping of a random compute-node:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
>>>>> >>>> 12345-0 at lo
>>>>> >>>> 12345-172.23.52.5 at o2ib
>>>>> >>>>
>>>>> >>>> LNET to OST01:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
>>>>> >>>> failed to ping 172.23.55.201 at o2ib: Input/output error
>>>>> >>>>
>>>>> >>>> LNET to OST02:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
>>>>> >>>> failed to ping 172.23.55.202 at o2ib: Input/output error
>>>>> >>>>
>>>>> >>>> 'normal' pings (on ip level) works fine:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# ping 172.23.55.201
>>>>> >>>> PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
>>>>> >>>> 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# ping 172.23.55.202
>>>>> >>>> PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
>>>>> >>>> 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
>>>>> >>>>
>>>>> >>>> lctl on a rebooted node:
>>>>> >>>>
>>>>> >>>> [root at pg-gpu01 ~]# lctl dl
>>>>> >>>>
>>>>> >>>> lctl on a not rebooted node:
>>>>> >>>>
>>>>> >>>> [root at pg-node005 ~]# lctl dl
>>>>> >>>>  0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d
>>>>> 5
>>>>> >>>>  1 UP lov pgtemp01-clilov-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 4
>>>>> >>>>  2 UP lmv pgtemp01-clilmv-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 4
>>>>> >>>>  3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>>  9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
>>>>> >>>> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>>>>> >>>> 28 UP lov pgdata01-clilov-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 4
>>>>> >>>> 29 UP lmv pgdata01-clilmv-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 4
>>>>> >>>> 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
>>>>> >>>> 996b1742-82eb-281c-c322-e244672d5225 5
>>>>> >>>> 56 UP lov pghome01-clilov-ffff88204bb50000
>>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>>>>> >>>> 57 UP lmv pghome01-clilmv-ffff88204bb50000
>>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>>>>> >>>> 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
>>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>>>>> >>>> 59 UP osc pghome01-OST0011-osc-ffff88204bb50000
>>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>>>>> >>>> 60 UP osc pghome01-OST0012-osc-ffff88204bb50000
>>>>> >>>> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>>>>> >>>>
>>>>> >>>> Please help, any clues/advice/hints/tips are appricated
>>>>> >>>>
>>>>> >>>> --
>>>>> >>>>
>>>>> >>>> Vriendelijke groet,
>>>>> >>>>
>>>>> >>>> Ger Strikwerda
>>>>> >>>> Chef Special
>>>>> >>>> Rijksuniversiteit Groningen
>>>>> >>>> Centrum voor Informatie Technologie
>>>>> >>>> Unit Pragmatisch Systeembeheer
>>>>> >>>>
>>>>> >>>> Smitsborg
>>>>> >>>> Nettelbosje 1
>>>>> >>>> 9747 AJ Groningen
>>>>> >>>> Tel. 050 363 9276
>>>>> >>>>
>>>>> >>>> "God is hard, God is fair
>>>>> >>>> some men he gave brains, others he gave hair"
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> _______________________________________________
>>>>> >>>> lustre-discuss mailing list
>>>>> >>>> lustre-discuss at lists.lustre.org
>>>>> >>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> >>>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> --
>>>>> >>> Vriendelijke groet,
>>>>> >>>
>>>>> >>> Ger Strikwerda
>>>>> >>>
>>>>> >>> Chef Special
>>>>> >>> Rijksuniversiteit Groningen
>>>>> >>> Centrum voor Informatie Technologie
>>>>> >>> Unit Pragmatisch Systeembeheer
>>>>> >>>
>>>>> >>> Smitsborg
>>>>> >>> Nettelbosje 1
>>>>> >>> 9747 AJ Groningen
>>>>> >>> Tel. 050 363 9276
>>>>> >>>
>>>>> >>>
>>>>> >>> "God is hard, God is fair
>>>>> >>> some men he gave brains, others he gave hair"
>>>>> >>> _______________________________________________
>>>>> >>> lustre-discuss mailing list
>>>>> >>> lustre-discuss at lists.lustre.org
>>>>> >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >> _______________________________________________
>>>>> >> lustre-discuss mailing list
>>>>> >> lustre-discuss at lists.lustre.org
>>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> lustre-discuss mailing list
>>>>> >> lustre-discuss at lists.lustre.org
>>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vriendelijke groet,
>>>>> >>
>>>>> >> Ger Strikwerda
>>>>> >>
>>>>> >> Chef Special
>>>>> >> Rijksuniversiteit Groningen
>>>>> >> Centrum voor Informatie Technologie
>>>>> >> Unit Pragmatisch Systeembeheer
>>>>> >>
>>>>> >> Smitsborg
>>>>> >> Nettelbosje 1
>>>>> >> 9747 AJ Groningen
>>>>> >> Tel. 050 363 9276
>>>>> >>
>>>>> >>
>>>>> >> "God is hard, God is fair
>>>>> >> some men he gave brains, others he gave hair"
>>>>> >> _______________________________________________
>>>>> >> lustre-discuss mailing list
>>>>> >> lustre-discuss at lists.lustre.org
>>>>> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> >
>>>>> > Cheers, Andreas
>>>>> > --
>>>>> > Andreas Dilger
>>>>> > Lustre Principal Architect
>>>>> > Intel Corporation
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > lustre-discuss mailing list
>>>>> > lustre-discuss at lists.lustre.org
>>>>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Vriendelijke groet,
>>>
>>> Ger StrikwerdaChef Special
>>> Rijksuniversiteit Groningen
>>> Centrum voor Informatie Technologie
>>> Unit Pragmatisch Systeembeheer
>>>
>>> Smitsborg
>>> Nettelbosje 1
>>> 9747 AJ Groningen
>>> Tel. 050 363 9276
>>> "God is hard, God is fair
>>>  some men he gave brains, others he gave hair"
>>>
>>>
>>
>
>
> --
>
> Vriendelijke groet,
>
> Ger StrikwerdaChef Special
> Rijksuniversiteit Groningen
> Centrum voor Informatie Technologie
> Unit Pragmatisch Systeembeheer
>
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170501/f5e0bedb/attachment-0001.htm>


More information about the lustre-discuss mailing list