[lustre-discuss] lustre-discuss at lists.lustre.org

John Villa jv2575 at columbia.edu
Tue Sep 20 14:59:34 PDT 2022


Hello Robert,
Sorry to hear that you're experiencing this issue. All signs here point to
an lnet configuration issue. Just to be safe did you verify that the IB
cables connecting this storage are clear of error counters? Also, would it
be possible to recreate the error you're experiencing and capture some
debugging data? In case you're not familiar here is how you'd generate the
debug data:

$ lctl set_param debug=+net (can verify if this is here with get_param
prior)
$ lctl dk > lnet.dk
# reproduce the error #
$ lctl dk lnet.dk

Also, what does an "lnetctl net show -v 3" on both a failed client and one
of the storage nodes show?

Thank You,
John Villa

On Tue, Sep 20, 2022 at 5:36 PM Robert Anderson via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:

> Looking for any suggestions on where to look, or what to change, to gain
> more information on RDMA timeout erors.
>
> We have a CentOS (release 7.9.2009, kernel 3.10.0-1160.66.1.el7.x86_64)
>  4 system (2 MDS + 2 OSS) zfs backed Lustre 2.12.9 server  on AMD (EPYC
> 7402P).  Each of these four has a single connection EDR card (Mellanox
> Technologies MT27800 Family [ConnectX-5]).
>
> The above Lustre should serve 54 linux compute node clients (heterogeneous
> hardware) running the same CentOS & kernel over 3 EDR switches (4 cables
> each from center to edge switches). Hardware is a mix of Intel CPUs & a few
> AMD, and have at least 5 variations of EDR and FDR single connection cards.
>
> All of these clients have been successfully utilizing a ClusterStor2000
> appliance, as well as a BeeGFS (7.2.6) storage cluster over the same
> fabric.  The IB network is not reporting any errors, and running at full
> rates to each client.  Only errors appear to be at the lnet level, and
> start on the client side.  The server eventually times out the client for
> no reponse
>
> We setup this server with the help of a very competent consultant and have
> had them back to help debug this problem.  But we are having difficulty
> finding anything to log or look at that helps.
>
>
> On some nodes (mostly older nodes) an RDMA timeout occurs on the client
> that locks up the clients communication to one of the Lustre targets.  We
> found that Bonnie++ and fio tests could cause these client lockups to one
> of the OSS nodes.  Swapping hardware between server nodes didn't reveal any
> pattern, and eventually we moved an entirely new node which failed in the
> same manner.  About 2/3 of the clients can be made to lockup, but no
> pattern in their individual hardware or network location seems to exist.
>  The remaining nodes seem to work perfectly with this new Lustre storage.
> The same sort of client RDM lockup has also been generated to a MDS with a
> simple "ls" of the lustre root after mounting.
>
>
> Below are client and MDS  log from that simplest of cases.
>
>  Sep 15 10:05:56 node113 systemd: Started Session 85 of user root.
> Sep 15 10:06:08 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:08 node113 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 24,
> npartitions: 2
> Sep 15 10:06:09 node113 kernel: alg: No test for adler32 (adler32-zlib)
> Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:09 node113 kernel: Lustre: Lustre: Build Version: 2.12.9
> Sep 15 10:06:09 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: LNet: Using FMR for registration
> Sep 15 10:06:10 node113 kernel: LNet: Added LNI 10.200.200.113 at o2ib
> [8/256/0/180]
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:10 node113 kernel: Request for unknown module key 'DKMS
> module signing key: b4c2eda7b66318967d50ba12b9cb1
> 8dcee995f32' err -11
> Sep 15 10:06:11 node113 kernel: Lustre: Mounted lz01-client
> Sep 15 10:06:56 node113 kernel: Lustre:
> 13799:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has
> timed
> out for slow reply: [sent 1663250809/real 1663250809]  req at ffff899095d04800
> x1744044838356992/t0(0) o101->lz01-MDT000
> 0-mdc-ffff8990be2fb800 at 10.200.200.54@o2ib:12/10 lens 576/968 e 0 to 1 dl
> 1663250816 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
> Sep 15 10:06:56 node113 kernel: Lustre: lz01-MDT0000-mdc-ffff8990be2fb800:
> Connection to lz01-MDT0000 (at 10.200.200.5
> 4 at o2ib) was lost; in progress operations using this service will wait for
> recovery to complete
> Sep 15 10:07:02 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx: active_
> txs, 0 seconds
> Sep 15 10:07:02 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with 10.200
> .200.54 at o2ib (12): c: 4, oc: 0, rc: 8
> Sep 15 10:07:03 node113 kernel: Lustre:
> 13725:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has
> timed out for sent delay: [sent 1663250816/real 0]  req at ffff899095d04c80
> x1744044838357056/t0(0) o400->MGC10.200.200.54 at o2ib@10.200.200.54 at o2ib:26/25
> lens 224/224 e 0 to 1 dl 1663250823 ref 2 fl Rpc:XN/0/
> ffffffff rc 0/-1
> Sep 15 10:07:03 node113 kernel: LustreError: 166-1: MGC10.200.200.54 at o2ib:
> Connection to MGS (at 10.200.200.54 at o2ib) was lost; in progress
> operations using this service will fail
> Sep 15 10:07:09 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx:
> active_txs, 0 seconds
> Sep 15 10:07:09 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with
> 10.200.200.54 at o2ib (5): c: 7, oc: 0, rc: 8
> Sep 15 10:07:10 node113 systemd-logind: New session 86 of user root.
> Sep 15 10:07:10 node113 systemd: Started Session 86 of user root.
> Sep 15 10:07:54 node113 systemd-logind: New session 87 of user root.
> Sep 15 10:07:54 node113 systemd: Started Session 87 of user root.
> Sep 15 10:10:34 node113 systemd-logind: Removed session 86.
> Sep 15 10:12:47 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx:
> active_txs, 0 seconds
> Sep 15 10:12:47 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with
> 10.200.200.54 at o2ib (5): c: 7, oc: 0, rc: 8
> Sep 15 10:13:37 node113 systemd-logind: New session 88 of user root.
> Sep 15 10:13:37 node113 systemd: Started Session 88 of user root.
> Sep 15 10:13:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:14:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:14:50 node113 systemd-logind: Removed session 88.
> Sep 15 10:15:33 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3373:kiblnd_check_txs_locked()) Timed out tx:
> active_txs, 0 seconds
> Sep 15 10:15:33 node113 kernel: LNetError:
> 13693:0:(o2iblnd_cb.c:3448:kiblnd_check_conns()) Timed out RDMA with
> 10.200.200.54 at o2ib (106): c: 8, oc: 0, rc: 8
> Sep 15 10:15:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:15:53 node113 systemd-logind: New session 89 of user root.
> Sep 15 10:15:53 node113 systemd: Started Session 89 of user root.
> Sep 15 10:16:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:17:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:17:58 node113 systemd-logind: Removed session 89.
> Sep 15 10:18:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:19:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:19:53 node113 systemd-logind: New session 90 of user root.
> Sep 15 10:19:53 node113 systemd: Started Session 90 of user root.
> Sep 15 10:20:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:21:52 node113 kernel: LustreError:
> 14973:0:(lmv_obd.c:1415:lmv_statfs()) lz01-MDT0000-mdc-ffff8990be2fb800:
> can't stat MDS #0: rc = -4
> Sep 15 10:21:56 node113 kernel: LustreError:
> 14977:0:(lmv_obd.c:1415:lmv_statfs()) lz01-MDT0000-mdc-ffff8990be2fb800:
> can't stat MDS #0: rc = -4
> Sep 15 10:22:29 node113 systemd-logind: New session 91 of user root.
> Sep 15 10:22:29 node113 systemd: Started Session 91 of user root.
> Sep 15 10:22:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> ping 12345-10.200.200.54 at o2ib: late network completion
> Sep 15 10:22:42 node113 kernel: LNet: 14296:0:(api-ni.c:4175:lnet_ping())
> Skipped 1 previous similar message
> Sep 15 10:22:50 node113 systemd-logind: Removed session 91.
> Sep 15 10:23:30 node113 systemd-logind: New session 92 of user root.
> Sep 15 10:23:30 node113 systemd: Started Session 92 of user root.
>
>
>
>
>  Sep 15 10:01:04 storage04 systemd: Removed slice User Slice of root.
> Sep 15 10:06:16 storage04 kernel: Lustre: MGS: Connection restored to
> 5b462af8-e997-25fa-7aac-62ef6f834520 (at 10.200.200.113 at o2ib)
> Sep 15 10:07:03 storage04 kernel: Lustre: lz01-MDT0000: Client
> fd07e75b-dbc1-20b1-ecbf-6f3a96a62d2a (at 10.200.200.113 at o2ib) reconnecting
> Sep 15 10:07:10 storage04 kernel: Lustre: MGS: Client
> 01e3bd51-26ab-f2bb-6515-6afe3a181876 (at 10.200.200.113 at o2ib) reconnecting
> Sep 15 10:10:01 storage04 systemd: Created slice User Slice of root.
> Sep 15 10:10:01 storage04 systemd: Started Session 2738 of user root.
> Sep 15 10:10:01 storage04 systemd: Removed slice User Slice of root.
> Sep 15 10:10:44 storage04 systemd: Created slice User Slice of root.
> Sep 15 10:10:44 storage04 systemd-logind: New session 2739 of user root.
> Sep 15 10:10:44 storage04 systemd: Started Session 2739 of user root.
> Sep 15 10:10:51 storage04 kernel: Lustre: lz01-MDT0000: haven't heard from
> client fd07e75b-dbc1-20b1-ecbf-6f3a96a62d2a (at 10.200.200.113 at o2ib) in
> 228 seconds. I think it's dead, and I am evicting it. exp ffff8fc5496cd800,
> cur 1663251051 expire 1663250901 last 1663250823
> Sep 15 10:10:51 storage04 kernel: Lustre: Skipped 1 previous similar
> message
> Sep 15 10:11:07 storage04 kernel: Lustre: MGS: haven't heard from client
> 01e3bd51-26ab-f2bb-6515-6afe3a181876 (at 10.200.200.113 at o2ib) in 237
> seconds. I think it's dead, and I am evicting it. exp ffff8fc6f0df9800, cur
> 1663251067 expire 1663250917 last 1663250830
> Sep 15 10:12:07 storage04 systemd-logind: Removed session 2739.
>
>
> Robert E. Anderson
>
> University of NH / Research Computing Center / Data Center Operations
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&d=DwICAg&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=Mo5B9RdV3X_SWFVkDQnFeEpeDvsQQirYGIY2JhvVXao&m=Y5U3ZyhZ-m25ZdHuwyU8ouDc_s45ODZRUP1wyAe7t956uHS4QGdQK-8geCDHWC2S&s=tV__Zfvt_hIDOV8ZAz7U5bY0GuIw67m_s1x2vvNK_gA&e=
>
>


-- 
Sincerely,
John Villa
Lead Research Systems Engineer
347-213-0339
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220920/61eca0f1/attachment-0001.htm>


More information about the lustre-discuss mailing list