[lustre-discuss] client fails to mount

Strikwerda, Ger g.j.c.strikwerda at rug.nl
Mon Apr 24 08:10:25 PDT 2017


Hi Rick,

Even without iptables rules and loading the correct modules afterwards, we
get the same results:

[root at pg-gpu01 sysconfig]# iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain LOGDROP (0 references)
target     prot opt source               destination
LOG        all  --  anywhere             anywhere            LOG level
warning
DROP       all  --  anywhere             anywhere

[root at pg-gpu01 sysconfig]# modprobe lnet

[root at pg-gpu01 sysconfig]# modprobe lustre

[root at pg-gpu01 sysconfig]# lctl ping 172.23.55.211 at o2ib
failed to ping 172.23.55.211 at o2ib: Input/output error







On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
rmohr at utk.edu> wrote:

> This might be a long shot, but have you checked for possible firewall
> rules that might be causing the issue?  I’m wondering if there is a chance
> that some rules were added after the nodes were up to allow Lustre access,
> and when a node got rebooted, it lost the rules.
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <g.j.c.strikwerda at rug.nl>
> wrote:
> >
> > Hi Russell,
> >
> > Thanks for the IB subnet clues:
> >
> > [root at pg-gpu01 ~]# ibv_devinfo
> > hca_id: mlx4_0
> >         transport:                      InfiniBand (0)
> >         fw_ver:                         2.32.5100
> >         node_guid:                      f452:1403:00f5:4620
> >         sys_image_guid:                 f452:1403:00f5:4623
> >         vendor_id:                      0x02c9
> >         vendor_part_id:                 4099
> >         hw_ver:                         0x1
> >         board_id:                       MT_1100120019
> >         phys_port_cnt:                  1
> >                 port:   1
> >                         state:                  PORT_ACTIVE (4)
> >                         max_mtu:                4096 (5)
> >                         active_mtu:             4096 (5)
> >                         sm_lid:                 1
> >                         port_lid:               185
> >                         port_lmc:               0x00
> >                         link_layer:             InfiniBand
> >
> > [root at pg-gpu01 ~]# sminfo
> > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
> priority 0 state 3 SMINFO_MASTER
> >
> > Looks like the rebooted node is able to connect/contact IB/IB
> subnetmanager
> >
> >
> >
> >
> > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu>
> wrote:
> > At first glance, this sounds like your Infiniband subnet manager may
> > be down or malfunctioning. In this case, nodes which were already up
> > when the subnet manager was working will continue to be able to
> > communicate over IB, but nodes which reboot after the SM goes down
> > will not.
> >
> > You can test this theory by running the 'ibv_devinfo' command on one
> > of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> > this confirms there is a problem with your subnet manager.
> >
> > Sincerely,
> > Rusty Dekema
> >
> >
> >
> >
> > On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
> > <g.j.c.strikwerda at rug.nl> wrote:
> > > Hi everybody,
> > >
> > > Here at the university of Groningen we are now experiencing a strange
> Lustre
> > > error. If a client reboots, it fails to mount the Lustre storage. The
> client
> > > is not able to reach the MSG service. The storage and nodes are
> > > communicating over IB and unitil now without any problems. It looks
> like an
> > > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > > storage. But the clients are able to LNET ping each other. Clients
> which not
> > > have been rebooted, are working fine and have their mounts on our
> Lustre
> > > filesystem.
> > >
> > > Lustre client log:
> > >
> > > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-
> 573.el6.x86_64
> > > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
> > >
> > > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > > 'pgdata01-client' failed (-5). This may be the result of communication
> > > errors between this node and the MGS, a bad configuration, or other
> errors.
> > > See the syslog for more information.
> > > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to
> process
> > > log: -5
> > > Lustre: Unmounted pgdata01-client
> > > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
> mount
> > > (-5)
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > > rejected: consumer defined fatal error
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > > similar message
> > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@
> Request sent
> > > has failed due to network error: [sent 1492789626/real 1492789626]
> > > req at ffff88105af2cc00 x1565303228072004/t0(0)
> > > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0
> to 1
> > > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > > previous similar message
> > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@
> send limit
> > > expired   req at ffff882041ffc000 x1565303228071996/t0(0)
> > > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0
> to 0
> > > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
> > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped
> 2
> > > previous similar messages
> > > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > > 'pghome01-client' failed (-5). This may be the result of communication
> > > errors between this node and the MGS, a bad configuration, or other
> errors.
> > > See the syslog for more information.
> > > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to
> process
> > > log: -5
> > >
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > > rejected: consumer defined fatal error
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > > similar message
> > > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > > 172.23.55.211 at o2ib failed: 5
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211 at o2ib
> > > rejected: consumer defined fatal error
> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > > similar message
> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > > messages for 172.23.55.211 at o2ib: connection failed
> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > > messages for 172.23.55.212 at o2ib: connection failed
> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > > 172.23.55.212 at o2ib failed: 5
> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17
> previous
> > > similar messages
> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > > messages for 172.23.55.211 at o2ib: connection failed
> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > > 172.23.55.212 at o2ib failed: 5
> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > > messages for 172.23.55.212 at o2ib: connection failed
> > >
> > > LNET ping of a metadata-node:
> > >
> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
> > > failed to ping 172.23.55.211 at o2ib: Input/output error
> > >
> > > LNET ping of the number 2 metadata-node:
> > >
> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
> > > failed to ping 172.23.55.212 at o2ib: Input/output error
> > >
> > > LNET ping of a random compute-node:
> > >
> > > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
> > > 12345-0 at lo
> > > 12345-172.23.52.5 at o2ib
> > >
> > > LNET to OST01:
> > >
> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
> > > failed to ping 172.23.55.201 at o2ib: Input/output error
> > >
> > > LNET to OST02:
> > >
> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
> > > failed to ping 172.23.55.202 at o2ib: Input/output error
> > >
> > > 'normal' pings (on ip level) works fine:
> > >
> > > [root at pg-gpu01 ~]# ping 172.23.55.201
> > > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
> > > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
> > >
> > > [root at pg-gpu01 ~]# ping 172.23.55.202
> > > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
> > > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
> > >
> > > lctl on a rebooted node:
> > >
> > > [root at pg-gpu01 ~]# lctl dl
> > >
> > > lctl on a not rebooted node:
> > >
> > > [root at pg-node005 ~]# lctl dl
> > >   0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d
> 5
> > >   1 UP lov pgtemp01-clilov-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> > >   2 UP lmv pgtemp01-clilmv-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> > >   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > >  28 UP lov pgdata01-clilov-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 4
> > >  29 UP lmv pgdata01-clilmv-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 4
> > >  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
> > > 996b1742-82eb-281c-c322-e244672d5225 5
> > >  56 UP lov pghome01-clilov-ffff88204bb50000
> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> > >  57 UP lmv pghome01-clilmv-ffff88204bb50000
> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> > >  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> > >  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> > >  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> > >
> > > Please help, any clues/advice/hints/tips are appricated
> > >
> > > --
> > >
> > > Vriendelijke groet,
> > >
> > > Ger Strikwerda
> > > Chef Special
> > > Rijksuniversiteit Groningen
> > > Centrum voor Informatie Technologie
> > > Unit Pragmatisch Systeembeheer
> > >
> > > Smitsborg
> > > Nettelbosje 1
> > > 9747 AJ Groningen
> > > Tel. 050 363 9276
> > >
> > > "God is hard, God is fair
> > >  some men he gave brains, others he gave hair"
> > >
> > >
> > > _______________________________________________
> > > lustre-discuss mailing list
> > > lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> > >
> >
> >
> >
> > --
> > Vriendelijke groet,
> >
> > Ger Strikwerda
> >
> > Chef Special
> > Rijksuniversiteit Groningen
> > Centrum voor Informatie Technologie
> > Unit Pragmatisch Systeembeheer
> >
> > Smitsborg
> > Nettelbosje 1
> > 9747 AJ Groningen
> > Tel. 050 363 9276
> >
> >
> > "God is hard, God is fair
> >  some men he gave brains, others he gave hair"
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
>
>


-- 

Vriendelijke groet,

Ger StrikwerdaChef Special
Rijksuniversiteit Groningen
Centrum voor Informatie Technologie
Unit Pragmatisch Systeembeheer

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
 some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170424/21de1677/attachment-0001.htm>


More information about the lustre-discuss mailing list