[lustre-discuss] Connection issues under load

Fri Jan 25 12:21:56 PST 2019

Hi everyone,

a v2.11.0 installation, there is an ongoing file transfer in Lustre running
on one of the client nodes. Every now and then the process stops with I/O
errors and then the file systems doesn't respond. After a while the access
to the FS resumes.

`lctl get_param *.*.state` shows a history of evictions of the targets:

mdc.scratch-MDT0000-mdc-ffff8ecec4bf2800.state=
> current_state: FULL
> state_history:
>  - [ 1546444758, CONNECTING ]
>  - [ 1546444758, FULL ]
>  <...>
>  - [ 1546594365, DISCONN ]
>  - [ 1546594388, CONNECTING ]
>  - [ 1546594388, RECOVER ]
>  - [ 1546594388, FULL ]
>
> mgc.MGC10.149.0.183 at o2ib.state=
> current_state: FULL
> state_history:
>  - [ 1546444756, CONNECTING ]
>  - [ 1546444756, FULL ]
>  <...>
>  - [ 1546594354, DISCONN ]
>  - [ 1546594363, CONNECTING ]
>  - [ 1546594363, EVICTED ]
>  - [ 1546594363, RECOVER ]
>
- [ 1546594363, FULL ]
> osc.scratch-OST0000-osc-ffff8ecec4bf2800.state=
> current_state: CONNECTING
> state_history:
>  - [ 1548316546, DISCONN ]
>  - [ 1548316551, CONNECTING ]
>  <...>
>  - [ 1548316851, CONNECTING ]
>  - [ 1548316921, DISCONN ]
>  - [ 1548316926, CONNECTING ]
>  - [ 1548316996, DISCONN ]
>  - [ 1548317001, CONNECTING ]
>  - [ 1548317071, DISCONN ]
>  - [ 1548317076, CONNECTING ]
>

and so on for all the affected by the file transfer OSTs.

In the OSS's logs there are all sorts of connectivity problems reported
both by Lustre:

> ods01: Jan 24 08:30:17 ods01 kernel: LustreError:
> 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) ldlm_cancel from
> 10.149.255.238 at o2ib arrived at 1548314992 with bad export cookie
> 13579970295878280357
> ods01: Jan 24 08:30:17 ods01 kernel: LustreError:
> 68843:0:(ldlm_lockd.c:2201:ldlm_cancel_handler()) Skipped 919 previous
> similar messages
> ods01: Jan 24 08:32:25 ods01 kernel: LustreError:
> 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) ### lock callback timer
> expired after 150s: evicting client at 10.149.255.238 at o2ib  ns:
> filter-scratch-OST0003_UUID lock: ffff888fa2924800/0xbc75bf08ca486964 lrc:
> 4/0,0 mode: PW/PW res: [0x19b6e36:0x0:0x0].0x0 rrc: 3 type: EXT
> [0->18446744073709551615] (req 716800->720895) flags: 0x60000400020020 nid:
> 10.149.255.238 at o2ib remote: 0xe347df981d48a71c expref: 55885 pid: 38849
> timeout: 3689597 lvb_type: 0
> ods01: Jan 24 08:32:25 ods01 kernel: LustreError:
> 0:0:(ldlm_lockd.c:331:waiting_locks_callback()) Skipped 6 previous similar
> messages
> ods01: Jan 24 08:33:15 ods01 kernel: LustreError: dumping log to
> /tmp/lustre-log.1548315195.186010
> ods01: Jan 24 08:44:41 ods01 kernel: Lustre:
> 186010:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has
> failed due to network error: [sent 1548314995/real 1548315881]
> req at ffff88975cc9d100 x1619663992901664/t0(0)
> o104->scratch-OST0002 at 10.149.255.238@o2ib:15/16 lens 296/224 e 0 to 1 dl
> 1548315006 ref 1 fl Rpc:eXSN/0/ffffffff rc -11/-1
>

and LNet:

> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Timed out tx:
> tx_queue, 1 seconds
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3251:kiblnd_check_txs_locked()) Skipped 13 previous
> similar messages
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Timed out RDMA with
> 10.149.255.254 at o2ib (52): c:
> 0, oc: 0, rc: 63
> ods01: Jan 24 05:49:26 ods01 kernel: LNetError:
> 38461:0:(o2iblnd_cb.c:3326:kiblnd_check_conns()) Skipped 13 previous
> similar messages
>

>From the client side:

> Jan 24 05:43:03 admin kernel: LustreError: 133-1:
> scratch-OST0001-osc-ffff8ecec4bf2800: BAD READ CHECKSUM: from
> 10.149.0.185 at o2ib via 0@<0:0> inode [0x200002366:0x160f1:0x0] object
> 0x0:4243598 extent [0-28671], client 0, server ac9ffb9, cksum_type 4
> Jan 24 05:43:03 admin kernel: LustreError:
> 3867:0:(osc_request.c:1681:osc_brw_redo_request()) @@@ redo for recoverable
> error -11  req at ffff8ecb19cd0000 x1621600181397872/t0(0)
> o3->scratch-OST0001-osc-ffff8ecec4bf2800 at 10.149.0.185@o2ib:6/4 lens
> 608/408 e 1 to 1 dl 1548304253 ref 2 fl Interpret:ReXM/0/0 rc 26860/26860
>

and on another client:

> Jan 24 09:03:31 c055 kernel: Lustre:
> 16641:0:(client.c:2100:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for slow reply: [sent 1548316966/real 1548316966]
> req at ffff995a6dab4e00 x1623278997809600/t0(0)
> o101->scratch-OST0024-osc-ffff995226fca000 at 10.149.0.191@o2ib:28/4 lens
> 3584/400 e 0 to 1 dl 1548317011 ref 1 fl Rpc:X/0/ffffffff rc 0/-1
> Jan 24 09:03:31 c055 kernel: Lustre:
> 16641:0:(client.c:2100:ptlrpc_expire_one_request()) Skipped 2324 previous
> similar messages
>

`opareport` reports no interconnect errors in the whole fabric. The problem
is present on different nodes in the cluster and when accessing different
OSTs, attached to different OSSs.

Both Lustre servers and clients are running v2.11.0, ZFS v0.7.9.
Interconnect: Omni-Path.

Based on the problem, looks very much like a bug (in LNet?). What
additional information might help to pinpoint the root cause?

Kind regards,
Konstantin Kurochkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20190125/fabbac7e/attachment.html>