[lustre-discuss] client fails to mount

Strikwerda, Ger g.j.c.strikwerda at rug.nl
Mon Apr 24 06:57:47 PDT 2017


 Hi everybody,

Here at the university of Groningen we are now experiencing a strange
Lustre error. If a client reboots, it fails to mount the Lustre storage.
The client is not able to reach the MSG service. The storage and nodes are
communicating over IB and unitil now without any problems. It looks like an
issue inside LNET. Clients cannot LNET ping/connect the metadata and or
storage. But the clients are able to LNET ping each other. Clients which
not have been rebooted, are working fine and have their mounts on our
Lustre filesystem.

Lustre client log:

Lustre: Lustre: Build Version: 2.5.3-RC1--PRISTINE-2.6.32-573.el6.x86_64
LNet: Added LNI 172.23.54.51 at o2ib [8/256/0/180]

LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
'pgdata01-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other errors.
See the syslog for more information.
LustreError: 3812:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
log: -5
Lustre: Unmounted pgdata01-client
LustreError: 3812:0:(obd_mount.c:1325:lustre_fill_super()) Unable to mount
(-5)
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) @@@ Request sent
has failed due to network error: [sent 1492789626/real 1492789626]
req at ffff88105af2cc00 x1565303228072004/t0(0) o250->MGC172.23.55.211 at o2ib
@172.23.55.212 at o2ib:26/25 lens 400/544 e 0 to 1 dl 1492789631 ref 1 fl
Rpc:XN/0/ffffffff rc 0/-1
Lustre: 3765:0:(client.c:1918:ptlrpc_expire_one_request()) Skipped 1
previous similar message
LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send
limit expired   req at ffff882041ffc000 x1565303228071996/t0(0)
o101->MGC172.23.55.211 at o2ib@172.23.55.211 at o2ib:26/25 lens 328/344 e 0 to 0
dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 3826:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 2
previous similar messages
LustreError: 15c-8: MGC172.23.55.211 at o2ib: The configuration from log
'pghome01-client' failed (-5). This may be the result of communication
errors between this node and the MGS, a bad configuration, or other errors.
See the syslog for more information.
LustreError: 3826:0:(llite_lib.c:1046:ll_fill_super()) Unable to process
log: -5

LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.212 at o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
LNet: 3755:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.211 at o2ib failed: 5
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) 172.23.55.211 at o2ib
rejected: consumer defined fatal error
LNetError: 2882:0:(o2iblnd_cb.c:2587:kiblnd_rejected()) Skipped 1 previous
similar message
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.211 at o2ib: connection failed
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.212 at o2ib: connection failed
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.212 at o2ib failed: 5
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Skipped 17 previous
similar messages
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.211 at o2ib: connection failed
LNet: 3754:0:(o2iblnd_cb.c:475:kiblnd_rx_complete()) Rx from
172.23.55.212 at o2ib failed: 5
LNet: 2882:0:(o2iblnd_cb.c:2072:kiblnd_peer_connect_failed()) Deleting
messages for 172.23.55.212 at o2ib: connection failed

LNET ping of a metadata-node:

[root at pg-gpu01 ~]# lctl ping 172.23.55.211 at o2ib
failed to ping 172.23.55.211 at o2ib: Input/output error

LNET ping of the number 2 metadata-node:

[root at pg-gpu01 ~]# lctl ping 172.23.55.212 at o2ib
failed to ping 172.23.55.212 at o2ib: Input/output error

LNET ping of a random compute-node:

[root at pg-gpu01 ~]# lctl ping 172.23.52.5 at o2ib
12345-0 at lo
12345-172.23.52.5 at o2ib

LNET to OST01:

[root at pg-gpu01 ~]# lctl ping 172.23.55.201 at o2ib
failed to ping 172.23.55.201 at o2ib: Input/output error

LNET to OST02:

[root at pg-gpu01 ~]# lctl ping 172.23.55.202 at o2ib
failed to ping 172.23.55.202 at o2ib: Input/output error

'normal' pings (on ip level) works fine:

[root at pg-gpu01 ~]# ping 172.23.55.201
PING 172.23.55.201 (172.23.55.201) 56(84) bytes of data.
64 bytes from 172.23.55.201: icmp_seq=1 ttl=64 time=0.741 ms

[root at pg-gpu01 ~]# ping 172.23.55.202
PING 172.23.55.202 (172.23.55.202) 56(84) bytes of data.
64 bytes from 172.23.55.202: icmp_seq=1 ttl=64 time=0.704 ms

lctl on a rebooted node:

[root at pg-gpu01 ~]# lctl dl

lctl on a not rebooted node:

[root at pg-node005 ~]# lctl dl
  0 UP mgc MGC172.23.55.211 at o2ib 94bd1c8a-512f-b920-9a4e-a6aced3d386d 5
  1 UP lov pgtemp01-clilov-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 4
  2 UP lmv pgtemp01-clilmv-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 4
  3 UP mdc pgtemp01-MDT0000-mdc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  4 UP osc pgtemp01-OST0001-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  5 UP osc pgtemp01-OST0003-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  6 UP osc pgtemp01-OST0005-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  7 UP osc pgtemp01-OST0007-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  8 UP osc pgtemp01-OST0009-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
  9 UP osc pgtemp01-OST000b-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 10 UP osc pgtemp01-OST000d-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 11 UP osc pgtemp01-OST000f-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 12 UP osc pgtemp01-OST0011-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 13 UP osc pgtemp01-OST0002-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 14 UP osc pgtemp01-OST0004-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 15 UP osc pgtemp01-OST0006-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 16 UP osc pgtemp01-OST0008-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 17 UP osc pgtemp01-OST000a-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 18 UP osc pgtemp01-OST000c-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 19 UP osc pgtemp01-OST000e-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 20 UP osc pgtemp01-OST0010-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 21 UP osc pgtemp01-OST0012-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 22 UP osc pgtemp01-OST0013-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 23 UP osc pgtemp01-OST0015-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 24 UP osc pgtemp01-OST0017-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 25 UP osc pgtemp01-OST0014-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 26 UP osc pgtemp01-OST0016-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 27 UP osc pgtemp01-OST0018-osc-ffff88206906d400
281c441f-8aa3-ab56-8812-e459d308f47c 5
 28 UP lov pgdata01-clilov-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 4
 29 UP lmv pgdata01-clilmv-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 4
 30 UP mdc pgdata01-MDT0000-mdc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 31 UP osc pgdata01-OST0001-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 32 UP osc pgdata01-OST0003-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 33 UP osc pgdata01-OST0005-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 34 UP osc pgdata01-OST0007-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 35 UP osc pgdata01-OST0009-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 36 UP osc pgdata01-OST000b-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 37 UP osc pgdata01-OST000d-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 38 UP osc pgdata01-OST000f-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 39 UP osc pgdata01-OST0002-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 40 UP osc pgdata01-OST0004-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 41 UP osc pgdata01-OST0006-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 42 UP osc pgdata01-OST0008-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 43 UP osc pgdata01-OST000a-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 44 UP osc pgdata01-OST000c-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 45 UP osc pgdata01-OST000e-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 46 UP osc pgdata01-OST0010-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 47 UP osc pgdata01-OST0013-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 48 UP osc pgdata01-OST0015-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 49 UP osc pgdata01-OST0017-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 50 UP osc pgdata01-OST0014-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 51 UP osc pgdata01-OST0016-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 52 UP osc pgdata01-OST0018-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 53 UP osc pgdata01-OST0019-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 54 UP osc pgdata01-OST001a-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 55 UP osc pgdata01-OST001b-osc-ffff88204bab6400
996b1742-82eb-281c-c322-e244672d5225 5
 56 UP lov pghome01-clilov-ffff88204bb50000
9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
 57 UP lmv pghome01-clilmv-ffff88204bb50000
9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 4
 58 UP mdc pghome01-MDT0000-mdc-ffff88204bb50000
9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
 59 UP osc pghome01-OST0011-osc-ffff88204bb50000
9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5
 60 UP osc pghome01-OST0012-osc-ffff88204bb50000
9ae8f2a9-1cdf-901f-160c-66f70e4c10d1 5

Please help, any clues/advice/hints/tips are appricated

-- 

Vriendelijke groet,

Ger StrikwerdaChef Special
Rijksuniversiteit Groningen
Centrum voor Informatie Technologie
Unit Pragmatisch Systeembeheer

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276
"God is hard, God is fair
 some men he gave brains, others he gave hair"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170424/0538445b/attachment.htm>


More information about the lustre-discuss mailing list