[lustre-discuss] client fails to mount
Strikwerda, Ger
g.j.c.strikwerda at rug.nl
Mon Apr 24 07:19:21 PDT 2017
Hi Russell,
Thanks for the IB subnet clues:
[root at pg-gpu01 ~]# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:00f5:4620
sys_image_guid: f452:1403:00f5:4623
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1100120019
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 185
port_lmc: 0x00
link_layer: InfiniBand
[root at pg-gpu01 ~]# sminfo
sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098
priority 0 state 3 SMINFO_MASTER
Looks like the rebooted node is able to connect/contact IB/IB subnetmanager
On Mon, Apr 24, 2017 at 4:14 PM, Russell Dekema <dekemar at umich.edu> wrote:
> At first glance, this sounds like your Infiniband subnet manager may
> be down or malfunctioning. In this case, nodes which were already up
> when the subnet manager was working will continue to be able to
> communicate over IB, but nodes which reboot after the SM goes down
> will not.
>
> You can test this theory by running the 'ibv_devinfo' command on one
> of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
> this confirms there is a problem with your subnet manager.
>
> Sincerely,
> Rusty Dekema
>
>
>
>
> On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
> <g.j.c.strikwerda at rug.nl> wrote:
> > Hi everybody,
> >
> > Here at the university of Groningen we are now experiencing a strange
> Lustre
> > error. If a client reboots, it fails to mount the Lustre storage. The
> client
> > is not able to reach the MSG service. The storage and nodes are
> > communicating over IB and unitil now without any problems. It looks like
> an
> > issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> > storage. But the clients are able to LNET ping each other. Clients which
> not
> > have been rebooted, are working fine and have their mounts on our Lustre
> > filesystem.
> >
> > Lustre client log:
> >
> > Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> > LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
> >
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pgdata01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> > Lustre: Unmounted pgdata01-client
> > LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to
> mount
> > (-5)
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request
> sent
> > has failed due to network error: [sent 1492789626/real 1492789626]
> > req at ffff88105af2cc00 x1565303228072004/t0(0)
> > o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0
> to 1
> > dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> > Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> > previous similar message
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
> limit
> > expired req at ffff882041ffc000 x1565303228071996/t0(0)
> > o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0
> to 0
> > dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
> > LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> > previous similar messages
> > LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> > 'pghome01-client' failed (-5). This may be the result of communication
> > errors between this node and the MGS, a bad configuration, or other
> errors.
> > See the syslog for more information.
> > LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> > log: -5
> >
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.212 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.211 at o2ib failed: 5
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected())
> 172.23.55.211 at o2ib
> > rejected: consumer defined fatal error
> > LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1
> previous
> > similar message
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
> > similar messages
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.211 at o2ib: connection failed
> > LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> > 172.23.55.212 at o2ib failed: 5
> > LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> > messages for 172.23.55.212 at o2ib: connection failed
> >
> > LNET ping of a metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
> > failed to ping 172.23.55.211 at o2ib: Input/output error
> >
> > LNET ping of the number 2 metadata-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
> > failed to ping 172.23.55.212 at o2ib: Input/output error
> >
> > LNET ping of a random compute-node:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
> > 12345-0 at lo
> > 12345-172.23.52.5 at o2ib
> >
> > LNET to OST01:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
> > failed to ping 172.23.55.201 at o2ib: Input/output error
> >
> > LNET to OST02:
> >
> > [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
> > failed to ping 172.23.55.202 at o2ib: Input/output error
> >
> > 'normal' pings (on ip level) works fine:
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.201
> > PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
> > 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
> >
> > [root at pg-gpu01 ~]# ping 172.23.55.202
> > PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
> > 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
> >
> > lctl on a rebooted node:
> >
> > [root at pg-gpu01 ~]# lctl dl
> >
> > lctl on a not rebooted node:
> >
> > [root at pg-node005 ~]# lctl dl
> > 0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
> > 1 UP lov pgtemp01-clilov-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> > 2 UP lmv pgtemp01-clilmv-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 4
> > 3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
> > 281c441f-8aa3-ab56-8812-e459d308f47c 5
> > 28 UP lov pgdata01-clilov-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> > 29 UP lmv pgdata01-clilmv-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 4
> > 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
> > 996b1742-82eb-281c-c322-e244672d5225 5
> > 56 UP lov pghome01-clilov-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> > 57 UP lmv pghome01-clilmv-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
> > 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> > 59 UP osc pghome01-OST0011-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> > 60 UP osc pghome01-OST0012-osc-ffff88204bb50000
> > 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
> >
> > Please help, any clues/advice/hints/tips are appricated
> >
> > --
> >
> > Vriendelijke groet,
> >
> > Ger Strikwerda
> > Chef Special
> > Rijksuniversiteit Groningen
> > Centrum voor Informatie Technologie
> > Unit Pragmatisch Systeembeheer
> >
> > Smitsborg
> > Nettelbosje 1
> > 9747 AJ Groningen
> > Tel. 050 363 9276
> >
> > "God is hard, God is fair
> > some men he gave brains, others he gave hair"
> >
> >
> > _______________________________________________
> > lustre-discuss mailing list
> > lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> >
>
--
Vriendelijke groet,
Ger StrikwerdaChef Special
Rijksuniversiteit Groningen
Centrum voor Informatie Technologie
Unit Pragmatisch Systeembeheer
Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170424/8579a50a/attachment-0001.htm>
More information about the lustre-discuss
mailing list