[lustre-discuss] client fails to mount

Raj rajgautam at gmail.com
Mon Apr 24 10:02:26 PDT 2017


May be worth checking your lnet credits and peer_credits in /etc/modprobe.d
?
You can compare between working hosts and non working hosts.
Thanks
_Raj
On Mon, Apr 24, 2017 at 10:10 AM Strikwerda, Ger <g.j.c.strikwerda at rug.nl>
wrote:

> Hi Rick,
>
> Even without iptables rules and loading the correct modules afterwards, we
> get the same results:
>
> [root at pg-gpu01 sysconfig]# iptables --list
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
>
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
>
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
>
> Chain LOGDROP (0 references)
> target     prot opt source               destination
> LOG        all  --  anywhere             anywhere            LOG level
> warning
> DROP       all  --  anywhere             anywhere
>
> [root at pg-gpu01 sysconfig]# modprobe lnet
>
> [root at pg-gpu01 sysconfig]# modprobe lustre
>
> [root at pg-gpu01 sysconfig]# lctl ping 172.23.55.211 at o2ib
>
> failed to ping 172.23.55.211 at o2ib: Input/output error
>
>
>
>
>
>
>
> On Mon, Apr 24, 2017 at 4:59 PM, Mohr Jr, Richard Frank (Rick Mohr) <
> rmohr at utk.edu> wrote:
>
>> This might be a long shot, but have you checked for possible firewall
>> rules that might be causing the issue?  I’m wondering if there is a chance
>> that some rules were added after the nodes were up to allow Lustre access,
>> and when a node got rebooted, it lost the rules.
>>
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> http://www.nics.tennessee.edu
>>
>>
>> > On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <g.j.c.strikwerda at rug.nl>
>> wrote:
>> >
>> > Hi Russell,
>> >
>> > Thanks for the IB subnet clues:
>> >
>> > [root at pg-gpu01 ~]# ibv_devinfo
>> > hca_id: mlx4_0
>> >         transport:                      InfiniBand (0)
>> >         fw_ver:                         2.32.5100
>> >         node_guid:                      f452:1403:00f5:4620
>> >         sys_image_guid:                 f452:1403:00f5:4623
>> >         vendor_id:                      0x02c9
>> >         vendor_part_id:                 4099
>> >         hw_ver:                         0x1
>> >         board_id:                       MT_1100120019
>> >         phys_port_cnt:                  1
>> >                 port:   1
>> >                         state:                  PORT_ACTIVE (4)
>> >                         max_mtu:                4096 (5)
>> >                         active_mtu:             4096 (5)
>> >                         sm_lid:                 1
>> >                         port_lid:               185
>> >                         port_lmc:               0x00
>> >                         link_layer:             InfiniBand
>> >
>> > [root at pg-gpu01 ~]# sminfo
>> > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
>> priority 0 state 3 SMINFO_MASTER
>> >
>> > Looks like the rebooted node is able to connect/contact IB/IB
>> subnetmanager
>> >
>> >
>> >
>> >
>> > On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu>
>> wrote:
>> > At first glance, this sounds like your Infiniband subnet manager may
>> > be down or malfunctioning. In this case, nodes which were already up
>> > when the subnet manager was working will continue to be able to
>> > communicate over IB, but nodes which reboot after the SM goes down
>> > will not.
>> >
>> > You can test this theory by running the 'ibv_devinfo' command on one
>> > of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
>> > this confirms there is a problem with your subnet manager.
>> >
>> > Sincerely,
>> > Rusty Dekema
>> >
>> >
>> >
>> >
>> > On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>> > <g.j.c.strikwerda at rug.nl> wrote:
>> > > Hi everybody,
>> > >
>> > > Here at the university of Groningen we are now experiencing a strange
>> Lustre
>> > > error. If a client reboots, it fails to mount the Lustre storage. The
>> client
>> > > is not able to reach the MSG service. The storage and nodes are
>> > > communicating over IB and unitil now without any problems. It looks
>> like an
>> > > issue inside LNET. Clients cannot LNET ping/connect the metadata and
>> or
>> > > storage. But the clients are able to LNET ping each other. Clients
>> which not
>> > > have been rebooted, are working fine and have their mounts on our
>> Lustre
>> > > filesystem.
>> > >
>> > > Lustre client log:
>> > >
>> > > Lustre: Lustre: Build Version:
>> 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> > > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>> > >
>> > > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
>> > > 'pgdata01-client' failed (-5). This may be the result of communication
>> > > errors between this node and the MGS, a bad configuration, or other
>> errors.
>> > > See the syslog for more information.
>> > > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to
>> process
>> > > log: -5
>> > > Lustre: Unmounted pgdata01-client
>> > > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
>> mount
>> > > (-5)
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.212 at o2ib
>> > > rejected: consumer defined fatal error
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> previous
>> > > similar message
>> > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@
>> Request sent
>> > > has failed due to network error: [sent 1492789626/real 1492789626]
>> > > req at ffff88105af2cc00 x1565303228072004/t0(0)
>> > > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e
>> 0 to 1
>> > > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>> > > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
>> > > previous similar message
>> > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@
>> send limit
>> > > expired   req at ffff882041ffc000 x1565303228071996/t0(0)
>> > > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e
>> 0 to 0
>> > > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
>> > > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped
>> 2
>> > > previous similar messages
>> > > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
>> > > 'pghome01-client' failed (-5). This may be the result of communication
>> > > errors between this node and the MGS, a bad configuration, or other
>> errors.
>> > > See the syslog for more information.
>> > > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to
>> process
>> > > log: -5
>> > >
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.212 at o2ib
>> > > rejected: consumer defined fatal error
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> previous
>> > > similar message
>> > > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > > 172.23.55.211 at o2ib failed: 5
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> 172.23.55.211 at o2ib
>> > > rejected: consumer defined fatal error
>> > > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> previous
>> > > similar message
>> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > > messages for 172.23.55.211 at o2ib: connection failed
>> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > > messages for 172.23.55.212 at o2ib: connection failed
>> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > > 172.23.55.212 at o2ib failed: 5
>> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17
>> previous
>> > > similar messages
>> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > > messages for 172.23.55.211 at o2ib: connection failed
>> > > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > > 172.23.55.212 at o2ib failed: 5
>> > > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > > messages for 172.23.55.212 at o2ib: connection failed
>> > >
>> > > LNET ping of a metadata-node:
>> > >
>> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
>> > > failed to ping 172.23.55.211 at o2ib: Input/output error
>> > >
>> > > LNET ping of the number 2 metadata-node:
>> > >
>> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
>> > > failed to ping 172.23.55.212 at o2ib: Input/output error
>> > >
>> > > LNET ping of a random compute-node:
>> > >
>> > > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
>> > > 12345-0 at lo
>> > > 12345-172.23.52.5 at o2ib
>> > >
>> > > LNET to OST01:
>> > >
>> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
>> > > failed to ping 172.23.55.201 at o2ib: Input/output error
>> > >
>> > > LNET to OST02:
>> > >
>> > > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
>> > > failed to ping 172.23.55.202 at o2ib: Input/output error
>> > >
>> > > 'normal' pings (on ip level) works fine:
>> > >
>> > > [root at pg-gpu01 ~]# ping 172.23.55.201
>> > > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
>> > > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
>> > >
>> > > [root at pg-gpu01 ~]# ping 172.23.55.202
>> > > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
>> > > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
>> > >
>> > > lctl on a rebooted node:
>> > >
>> > > [root at pg-gpu01 ~]# lctl dl
>> > >
>> > > lctl on a not rebooted node:
>> > >
>> > > [root at pg-node005 ~]# lctl dl
>> > >   0 UP mgc MGC172.23.55.211 at o2ib
>> 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
>> > >   1 UP lov pgtemp01-clilov-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 4
>> > >   2 UP lmv pgtemp01-clilmv-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 4
>> > >   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
>> > > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> > >  28 UP lov pgdata01-clilov-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 4
>> > >  29 UP lmv pgdata01-clilmv-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 4
>> > >  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
>> > > 996b1742-82eb-281c-c322-e244672d5225 5
>> > >  56 UP lov pghome01-clilov-ffff88204bb50000
>> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>> > >  57 UP lmv pghome01-clilmv-ffff88204bb50000
>> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>> > >  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
>> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> > >  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
>> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> > >  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
>> > > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> > >
>> > > Please help, any clues/advice/hints/tips are appricated
>> > >
>> > > --
>> > >
>> > > Vriendelijke groet,
>> > >
>> > > Ger Strikwerda
>> > > Chef Special
>> > > Rijksuniversiteit Groningen
>> > > Centrum voor Informatie Technologie
>> > > Unit Pragmatisch Systeembeheer
>> > >
>> > > Smitsborg
>> > > Nettelbosje 1
>> > > 9747 AJ Groningen
>> > > Tel. 050 363 9276
>> > >
>> > > "God is hard, God is fair
>> > >  some men he gave brains, others he gave hair"
>> > >
>> > >
>> > > _______________________________________________
>> > > lustre-discuss mailing list
>> > > lustre-discuss at lists.lustre.org
>> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> > >
>> >
>> >
>> >
>> > --
>> > Vriendelijke groet,
>> >
>> > Ger Strikwerda
>> >
>> > Chef Special
>> > Rijksuniversiteit Groningen
>> > Centrum voor Informatie Technologie
>> > Unit Pragmatisch Systeembeheer
>> >
>> > Smitsborg
>> > Nettelbosje 1
>> > 9747 AJ Groningen
>> > Tel. 050 363 9276
>> >
>> >
>> > "God is hard, God is fair
>> >  some men he gave brains, others he gave hair"
>> > _______________________________________________
>> > lustre-discuss mailing list
>> > lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>>
>>
>
>
> --
>
> Vriendelijke groet,
>
> Ger StrikwerdaChef Special
> Rijksuniversiteit Groningen
> Centrum voor Informatie Technologie
> Unit Pragmatisch Systeembeheer
>
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170424/378dabe6/attachment-0001.htm>


More information about the lustre-discuss mailing list