[lustre-discuss] client fails to mount

Strikwerda, Ger g.j.c.strikwerda at rug.nl
Mon Apr 24 07:19:21 PDT 2017


Hi Russell,

Thanks for the IB subnet clues:

[root at pg-gpu01 ~]# ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.32.5100
        node_guid:                      f452:1403:00f5:4620
        sys_image_guid:                 f452:1403:00f5:4623
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x1
        board_id:                       MT_1100120019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               185
                        port_lmc:               0x00
                        link_layer:             InfiniBand

[root at pg-gpu01 ~]# sminfo
sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
priority 0 state 3 SMINFO_MASTER

Looks like the rebooted node is able to connect/contact IB/IB subnetmanager




On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu> wrote:

> At first glance, this sounds like your Infiniband subnet manager may
> be down or malfunctioning. In this case, nodes which were already up
> when the subnet manager was working will continue to be able to
> communicate over IB, but nodes which reboot after the SM goes down
> will not.
>
> You can test this theory by running the 'ibv_devinfo' command on one
> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> this confirms there is a problem with your subnet manager.
>
> Sincerely,
> Rusty Dekema
>
>
>
>
> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
> <g.j.c.strikwerda at rug.nl> wrote:
> > Hi everybody,
> >
> > Here at the university of Groningen we are now experiencing a strange
> Lustre
> > error. If a client reboots, it fails to mount the Lustre storage. The
> client
> > is not able to reach the MSG service. The storage and nodes are
> > communicating over IB and unitil now without any problems. It looks like
> an
> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > storage. But the clients are able to LNET ping each other. Clients which
> not
> > have been rebooted, are working fine and have their mounts on our Lustre
> > filesystem.
> >
> > Lustre client log:
> >
> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
> >
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pgdata01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> > Lustre: Unmounted pgdata01-client
> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
> mount
> > (-5)
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request
> sent
> > has failed due to network error: [sent 1492789626/real 1492789626]
> > req at ffff88105af2cc00 x1565303228072004/t0(0)
> > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0
> to 1
> > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > previous similar message
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
> limit
> > expired   req at ffff882041ffc000 x1565303228071996/t0(0)
> > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0
> to 0
> > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> > previous similar messages
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pghome01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> >
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.211 at o2ib failed: 5
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
> > similar messages
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> >
> > LNET ping of a metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
> > failed to ping 172.23.55.211 at o2ib: Input/output error
> >
> > LNET ping of the number 2 metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
> > failed to ping 172.23.55.212 at o2ib: Input/output error
> >
> > LNET ping of a random compute-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
> > 12345-0 at lo
> > 12345-172.23.52.5 at o2ib
> >
> > LNET to OST01:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
> > failed to ping 172.23.55.201 at o2ib: Input/output error
> >
> > LNET to OST02:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
> > failed to ping 172.23.55.202 at o2ib: Input/output error
> >
> > 'normal' pings (on ip level) works fine:
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.201
> > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
> > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.202
> > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
> > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
> >
> > lctl on a rebooted node:
> >
> > [root at pg-gpu01 ~]# lctl dl
> >
> > lctl on a not rebooted node:
> >
> > [root at pg-node005 ~]# lctl dl
> >   0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
> >   1 UP lov pgtemp01-clilov-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> >   2 UP lmv pgtemp01-clilmv-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> >   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  28 UP lov pgdata01-clilov-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> >  29 UP lmv pgdata01-clilmv-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> >  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  56 UP lov pghome01-clilov-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> >  57 UP lmv pghome01-clilmv-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> >  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >
> > Please help, any clues/advice/hints/tips are appricated
> >
> > --
> >
> > Vriendelijke groet,
> >
> > Ger Strikwerda
> > Chef Special
> > Rijksuniversiteit Groningen
> > Centrum voor Informatie Technologie
> > Unit Pragmatisch Systeembeheer
> >
> > Smitsborg
> > Nettelbosje 1
> > 9747 AJ Groningen
> > Tel. 050 363 9276
> >
> > "God is hard, God is fair
> >  some men he gave brains, others he gave hair"
> >
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>



-- 

Vriendelijke groet,

Ger StrikwerdaChef Special
Rijksuniversiteit Groningen
Centrum voor Informatie Technologie
Unit Pragmatisch Systeembeheer

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
 some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170424/8579a50a/attachment-0001.htm>


More information about the lustre-discuss mailing list