[lustre-discuss] client fails to mount

Russell Dekema dekemar at umich.edu
Mon Apr 24 07:27:07 PDT 2017


Oh, ok, that seems to rule the subnet manager out.

I mis-read your IP network numbers earlier and thought you had not
tried regular IP-ping across your IPoIB interfaces, but, upon
re-reading your initial message, it seems you have tried this and it
does work, even between a client with non-working Lustre and your
MGS/MDS.

In this case, I have no further suggestions.

Best of luck,
Rusty D.

On Mon, Apr 24, 2017 at 10:19 AM, Strikwerda, Ger
<g.j.c.strikwerda at rug.nl> wrote:
> Hi Russell,
>
> Thanks for the IB subnet clues:
>
> [root at pg-gpu01 ~]# ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.32.5100
>         node_guid:                      f452:1403:00f5:4620
>         sys_image_guid:                 f452:1403:00f5:4623
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x1
>         board_id:                       MT_1100120019
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               185
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
>
> [root at pg-gpu01 ~]# sminfo
> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
> priority 0 state 3 SMINFO_MASTER
>
> Looks like the rebooted node is able to connect/contact IB/IB subnetmanager
>
>
>
>
> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu> wrote:
>>
>> At first glance, this sounds like your Infiniband subnet manager may
>> be down or malfunctioning. In this case, nodes which were already up
>> when the subnet manager was working will continue to be able to
>> communicate over IB, but nodes which reboot after the SM goes down
>> will not.
>>
>> You can test this theory by running the 'ibv_devinfo' command on one
>> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
>> this confirms there is a problem with your subnet manager.
>>
>> Sincerely,
>> Rusty Dekema
>>
>>
>>
>>
>> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
>> <g.j.c.strikwerda at rug.nl> wrote:
>> > Hi everybody,
>> >
>> > Here at the university of Groningen we are now experiencing a strange
>> > Lustre
>> > error. If a client reboots, it fails to mount the Lustre storage. The
>> > client
>> > is not able to reach the MSG service. The storage and nodes are
>> > communicating over IB and unitil now without any problems. It looks like
>> > an
>> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
>> > storage. But the clients are able to LNET ping each other. Clients which
>> > not
>> > have been rebooted, are working fine and have their mounts on our Lustre
>> > filesystem.
>> >
>> > Lustre client log:
>> >
>> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
>> > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>> >
>> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
>> > 'pgdata01-client' failed (-5). This may be the result of communication
>> > errors between this node and the MGS, a bad configuration, or other
>> > errors.
>> > See the syslog for more information.
>> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
>> > log: -5
>> > Lustre: Unmounted pgdata01-client
>> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
>> > mount
>> > (-5)
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> > 172.23.55.212 at o2ib
>> > rejected: consumer defined fatal error
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> > previous
>> > similar message
>> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request
>> > sent
>> > has failed due to network error: [sent 1492789626/real 1492789626]
>> > req at ffff88105af2cc00 x1565303228072004/t0(0)
>> > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0 to
>> > 1
>> > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
>> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
>> > previous similar message
>> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
>> > limit
>> > expired   req at ffff882041ffc000 x1565303228071996/t0(0)
>> > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0 to
>> > 0
>> > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
>> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
>> > previous similar messages
>> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
>> > 'pghome01-client' failed (-5). This may be the result of communication
>> > errors between this node and the MGS, a bad configuration, or other
>> > errors.
>> > See the syslog for more information.
>> > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
>> > log: -5
>> >
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> > 172.23.55.212 at o2ib
>> > rejected: consumer defined fatal error
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> > previous
>> > similar message
>> > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > 172.23.55.211 at o2ib failed: 5
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
>> > 172.23.55.211 at o2ib
>> > rejected: consumer defined fatal error
>> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
>> > previous
>> > similar message
>> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > messages for 172.23.55.211 at o2ib: connection failed
>> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > messages for 172.23.55.212 at o2ib: connection failed
>> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > 172.23.55.212 at o2ib failed: 5
>> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
>> > similar messages
>> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > messages for 172.23.55.211 at o2ib: connection failed
>> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
>> > 172.23.55.212 at o2ib failed: 5
>> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
>> > messages for 172.23.55.212 at o2ib: connection failed
>> >
>> > LNET ping of a metadata-node:
>> >
>> > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
>> > failed to ping 172.23.55.211 at o2ib: Input/output error
>> >
>> > LNET ping of the number 2 metadata-node:
>> >
>> > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
>> > failed to ping 172.23.55.212 at o2ib: Input/output error
>> >
>> > LNET ping of a random compute-node:
>> >
>> > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
>> > 12345-0 at lo
>> > 12345-172.23.52.5 at o2ib
>> >
>> > LNET to OST01:
>> >
>> > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
>> > failed to ping 172.23.55.201 at o2ib: Input/output error
>> >
>> > LNET to OST02:
>> >
>> > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
>> > failed to ping 172.23.55.202 at o2ib: Input/output error
>> >
>> > 'normal' pings (on ip level) works fine:
>> >
>> > [root at pg-gpu01 ~]# ping 172.23.55.201
>> > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
>> > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
>> >
>> > [root at pg-gpu01 ~]# ping 172.23.55.202
>> > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
>> > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
>> >
>> > lctl on a rebooted node:
>> >
>> > [root at pg-gpu01 ~]# lctl dl
>> >
>> > lctl on a not rebooted node:
>> >
>> > [root at pg-node005 ~]# lctl dl
>> >   0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
>> >   1 UP lov pgtemp01-clilov-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
>> >   2 UP lmv pgtemp01-clilmv-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
>> >   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
>> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
>> >  28 UP lov pgdata01-clilov-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 4
>> >  29 UP lmv pgdata01-clilmv-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 4
>> >  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
>> > 996b1742-82eb-281c-c322-e244672d5225 5
>> >  56 UP lov pghome01-clilov-ffff88204bb50000
>> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>> >  57 UP lmv pghome01-clilmv-ffff88204bb50000
>> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>> >  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
>> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> >  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
>> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> >  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
>> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>> >
>> > Please help, any clues/advice/hints/tips are appricated
>> >
>> > --
>> >
>> > Vriendelijke groet,
>> >
>> > Ger Strikwerda
>> > Chef Special
>> > Rijksuniversiteit Groningen
>> > Centrum voor Informatie Technologie
>> > Unit Pragmatisch Systeembeheer
>> >
>> > Smitsborg
>> > Nettelbosje 1
>> > 9747 AJ Groningen
>> > Tel. 050 363 9276
>> >
>> > "God is hard, God is fair
>> >  some men he gave brains, others he gave hair"
>> >
>> >
>> > _______________________________________________
>> > lustre-discuss mailing list
>> > lustre-discuss at lists.lustre.org
>> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> >
>
>
>
>
> --
>
> Vriendelijke groet,
>
> Ger Strikwerda
> Chef Special
> Rijksuniversiteit Groningen
> Centrum voor Informatie Technologie
> Unit Pragmatisch Systeembeheer
>
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
>
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"


More information about the lustre-discuss mailing list