[lustre-discuss] client fails to mount

Russell Dekema dekemar at umich.edu
Mon Apr 24 07:14:02 PDT 2017


At first glance, this sounds like your Infiniband subnet manager may
be down or malfunctioning. In this case, nodes which were already up
when the subnet manager was working will continue to be able to
communicate over IB, but nodes which reboot after the SM goes down
will not.

You can test this theory by running the 'ibv_devinfo' command on one
of your rebooted nodes. If the relevant IB port is in state PORT_INIT,
this confirms there is a problem with your subnet manager.

Sincerely,
Rusty Dekema




On Mon, Apr 24, 2017 at 9:57 AM, Strikwerda, Ger
<g.j.c.strikwerda at rug.nl> wrote:
> Hi everybody,
>
> Here at the university of Groningen we are now experiencing a strange Lustre
> error. If a client reboots, it fails to mount the Lustre storage. The client
> is not able to reach the MSG service. The storage and nodes are
> communicating over IB and unitil now without any problems. It looks like an
> issue inside LNET. Clients cannot LNET ping/connect the metadata and or
> storage. But the clients are able to LNET ping each other. Clients which not
> have been rebooted, are working fine and have their mounts on our Lustre
> filesystem.
>
> Lustre client log:
>
> Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
> LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]
>
> LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> 'pgdata01-client' failed (-5). This may be the result of communication
> errors between this node and the MGS, a bad configuration, or other errors.
> See the syslog for more information.
> LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> log: -5
> Lustre: Unmounted pgdata01-client
> LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
> (-5)
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
> has failed due to network error: [sent 1492789626/real 1492789626]
> req at ffff88105af2cc00 x1565303228072004/t0(0)
> o250->MGC172.23.55.211 at o2ib@172.23.55.212 at o2ib:26/25 lens 400/544 e 0 to 1
> dl 1492789631 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
> Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
> previous similar message
> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit
> expired   req at ffff882041ffc000 x1565303228071996/t0(0)
> o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0 to 0
> dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
> LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
> previous similar messages
> LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
> 'pghome01-client' failed (-5). This may be the result of communication
> errors between this node and the MGS, a bad configuration, or other errors.
> See the syslog for more information.
> LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
> log: -5
>
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.211 at o2ib failed: 5
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211 at o2ib
> rejected: consumer defined fatal error
> LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
> similar message
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.211 at o2ib: connection failed
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.212 at o2ib: connection failed
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.212 at o2ib failed: 5
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
> similar messages
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.211 at o2ib: connection failed
> LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
> 172.23.55.212 at o2ib failed: 5
> LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
> messages for 172.23.55.212 at o2ib: connection failed
>
> LNET ping of a metadata-node:
>
> [root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
> failed to ping 172.23.55.211 at o2ib: Input/output error
>
> LNET ping of the number 2 metadata-node:
>
> [root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
> failed to ping 172.23.55.212 at o2ib: Input/output error
>
> LNET ping of a random compute-node:
>
> [root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
> 12345-0 at lo
> 12345-172.23.52.5 at o2ib
>
> LNET to OST01:
>
> [root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
> failed to ping 172.23.55.201 at o2ib: Input/output error
>
> LNET to OST02:
>
> [root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
> failed to ping 172.23.55.202 at o2ib: Input/output error
>
> 'normal' pings (on ip level) works fine:
>
> [root at pg-gpu01 ~]# ping 172.23.55.201
> PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
> 64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms
>
> [root at pg-gpu01 ~]# ping 172.23.55.202
> PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
> 64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms
>
> lctl on a rebooted node:
>
> [root at pg-gpu01 ~]# lctl dl
>
> lctl on a not rebooted node:
>
> [root at pg-node005 ~]# lctl dl
>   0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
>   1 UP lov pgtemp01-clilov-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 4
>   2 UP lmv pgtemp01-clilmv-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 4
>   3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>   9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
> 281c441f-8aa3-ab56-8812-e459d308f47c 5
>  28 UP lov pgdata01-clilov-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 4
>  29 UP lmv pgdata01-clilmv-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 4
>  30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
> 996b1742-82eb-281c-c322-e244672d5225 5
>  56 UP lov pghome01-clilov-ffff88204bb50000
> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>  57 UP lmv pghome01-clilmv-ffff88204bb50000
> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
>  58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>  59 UP osc pghome01-OST0011-osc-ffff88204bb50000
> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>  60 UP osc pghome01-OST0012-osc-ffff88204bb50000
> 9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
>
> Please help, any clues/advice/hints/tips are appricated
>
> --
>
> Vriendelijke groet,
>
> Ger Strikwerda
> Chef Special
> Rijksuniversiteit Groningen
> Centrum voor Informatie Technologie
> Unit Pragmatisch Systeembeheer
>
> Smitsborg
> Nettelbosje 1
> 9747 AJ Groningen
> Tel. 050 363 9276
>
> "God is hard, God is fair
>  some men he gave brains, others he gave hair"
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


More information about the lustre-discuss mailing list