[lustre-discuss] client fails to mount

Mohr Jr, Richard Frank (Rick Mohr) rmohr at utk.edu
Mon Apr 24 07:59:21 PDT 2017


This might be a long shot, but have you checked for possible firewall rules that might be causing the issue?  I’m wondering if there is a chance that some rules were added after the nodes were up to allow Lustre access, and when a node got rebooted, it lost the rules.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


> On Apr 24, 2017, at 10:19 AM, Strikwerda, Ger <g.j.c.strikwerda at rug.nl> wrote:
> 
> Hi Russell,
> 
> Thanks for the IB subnet clues:
> 
> [root at pg-gpu01 ~]# ibv_devinfo
> hca_id: mlx4_0
>         transport:                      InfiniBand (0)
>         fw_ver:                         2.32.5100
>         node_guid:                      f452:1403:00f5:4620
>         sys_image_guid:                 f452:1403:00f5:4623
>         vendor_id:                      0x02c9
>         vendor_part_id:                 4099
>         hw_ver:                         0x1
>         board_id:                       MT_1100120019
>         phys_port_cnt:                  1
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                4096 (5)
>                         active_mtu:             4096 (5)
>                         sm_lid:                 1
>                         port_lid:               185
>                         port_lmc:               0x00
>                         link_layer:             InfiniBand
> 
> [root at pg-gpu01 ~]# sminfo 
> sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098 priority 0 state 3 SMINFO_MASTER
> 
> Looks like the rebooted node is able to connect/contact IB/IB subnetmanager
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu> wrote:
> At first glance, this sounds like your Infiniband subnet manager may
> be down or malfunctioning. In this case, nodes which were already up
> when the subnet manager was working will continue to be able to
> communicate over IB, but nodes which reboot after the SM goes down
> will not.
> 
> You can test this theory by running the 'ibv_devinfo' command on one
> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> this confirms there is a problem with your subnet manager.
> 
> Sincerely,
> Rusty Dekema
> 
> 
> 
> 
> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
> <g.j.c.strikwerda at rug.nl> wrote:
> > Hi everybody,
> >
> > Here at the university of Groningen we are now experiencing a strange Lustre
> > error. If a client reboots, it fails to mount the Lustre storage. The client
> > is not able to reach the MSG service. The storage and nodes are
> > communicating over IB and unitil now without any problems. It looks like an
> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > storage. But the clients are able to LNET ping each other. Clients which not
> > have been rebooted, are working fine and have their mounts on our Lustre
> > filesystem.
> >
> > Lustre client log:
> >
> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
> >
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pgdata01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other errors.
> > See the syslog for more information.
> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> > Lustre: Unmounted pgdata01-client
> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
> > (-5)
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> > similar message
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
> > has failed due to network error: [sent 1492789626/real 1492789626]
> > req at ffff88105af2cc00 x1565303228072004/t0(0)
> > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0 to 1
> > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > previous similar message
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit
> > expired   req at ffff882041ffc000 x1565303228071996/t0(0)
> > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0 to 0
> > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> > previous similar messages
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pghome01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other errors.
> > See the syslog for more information.
> > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> >
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> > similar message
> > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.211 at o2ib failed: 5
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> > similar message
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
> > similar messages
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> >
> > LNET ping of a metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
> > failed to ping 172.23.55.211 at o2ib: Input/output error
> >
> > LNET ping of the number 2 metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
> > failed to ping 172.23.55.212 at o2ib: Input/output error
> >
> > LNET ping of a random compute-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
> > 12345-0 at lo
> > 12345-172.23.52.5 at o2ib
> >
> > LNET to OST01:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
> > failed to ping 172.23.55.201 at o2ib: Input/output error
> >
> > LNET to OST02:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
> > failed to ping 172.23.55.202 at o2ib: Input/output error
> >
> > 'normal' pings (on ip level) works fine:
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.201
> > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
> > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.202
> > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
> > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
> >
> > lctl on a rebooted node:
> >
> > [root at pg-gpu01 ~]# lctl dl
> >
> > lctl on a not rebooted node:
> >
> > [root at pg-node005 ~]# lctl dl
> >   0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
> >   1 UP lov pgtemp01-clilov-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> >   2 UP lmv pgtemp01-clilmv-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> >   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> >  28 UP lov pgdata01-clilov-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> >  29 UP lmv pgdata01-clilmv-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> >  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> >  56 UP lov pghome01-clilov-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> >  57 UP lmv pghome01-clilmv-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> >  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >
> > Please help, any clues/advice/hints/tips are appricated
> >
> > --
> >
> > Vriendelijke groet,
> >
> > Ger Strikwerda
> > Chef Special
> > Rijksuniversiteit Groningen
> > Centrum voor Informatie Technologie
> > Unit Pragmatisch Systeembeheer
> >
> > Smitsborg
> > Nettelbosje 1
> > 9747 AJ Groningen
> > Tel. 050 363 9276
> >
> > "God is hard, God is fair
> >  some men he gave brains, others he gave hair"
> >
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
> 
> 
> 
> -- 
> Vriendelijke groet, 
> 
> Ger Strikwerda
> 
> Chef Special
> Rijksuniversiteit Groningen
> Centrum voor Informatie Technologie
> Unit Pragmatisch Systeembeheer
> 
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
> 
> 
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




More information about the lustre-discuss mailing list