[lustre-discuss] routed LNET connection between server and virtualized client is failing

Uwe Sauter uwe.sauter.de at gmail.com
Thu Aug 13 05:20:48 PDT 2020


Dear all,

(TL;DR at the bottom)

I have the following situation:


+----------------+
|                +--------------+                +-------------------------------------------------+
| Lustre servers |              |                |                                                 |
|    @ o2ib20    |    +---------+--------+       |  Virtualization host:                           |
|                |    |                  |       |  * Proxmox 6.2, up-to-date                      |
+----------------+    |      o2ib20      |       |  ** Debian 10.5 based                           |
                      |  10.148.0.0/16   |       |  ** Ubuntu based kernel 5.4.44-2-pve            |
                      |                  |       |  * ConnectX-3 (MCX354A-FCBT)                    |
                      +---------+--------+       |  ** 15 VFs configured                           |
                                |                |  ** SR-IOV                                      |
         +----------------------+                |  * OFED provided by distribution                |
         |                                       |                                                 |
+--------+-------+                               |                                                 |
|  LNET router   |                               |  Virtual machines and LNET routers:             |
+--------+-------+                               |  * CentOS 7.8 based                             |
         |                                       |  * OFED provided by CentOS                      |
         +----------------------+                |  * Lustre 2.12.5                                |
                                |                |  * Kernel 3.10.0-1127.18.2.el7                  |
                      +---------+--------+       |                                                 |
                      |                  |-----------------+--------------+--------------+         |
                      |      o2ib43      |       |         |              |              |         |
+----------------+    |  10.225.0.0/16   |       |         |              |              |         |
|                |    |                  |       |  +------+-----+ +------+-----+ +------+-----+   |
| Lustre servers |    +---------+--------+       |  |            | |            | |            |   |
|    @ o2ib43    |              |                |  |    VM 1    | |    VM 2    | |    VM 3    |   |
|                +--------------+                |  |  @ o2ib43  | |  @ o2ib43  | |  @ o2ib43  |   |
+----------------+                               |  |            | |            | |            |   |
                                                 |  |            | |            | |            |   |
                                                 |  +------------+ +------------+ +------------+   |
                                                 |                                                 |
                                                 +-------------------------------------------------+


Lustre @ o2ib20 is a Sonexion appliance based on CentOS 7.2 and Lustre version 2.11.0.300_cray_43_gd35e657_dirty.

Lustre @ 02ib43 is CentOS 7.6 based setup with kernel 3.10.0-957.1.3.el7_lustre and Lustre version lustre-2.10.7.1nec-1.el7.x86_64.

The issue I currently see is that once more that one VM is running on the virualization host
then access to the Lustre file system behind the LNET routers is stuck.

The errors I can see on the VM is e.g.:

[ 1297.470192] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1297.472058] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1297.473909] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent
1597316593/real 1597316593]  req at ffff89365cee0900 x1674906532108800/t0
(0) o4->snx11167-OST001a-osc-ffff893468b2f800 at 10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316688 ref 2 fl
Rpc:eX/0/ffffffff rc 0/-1
[ 1297.479055] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33 at o2ib20) was lost;
in progress operations using this service will wait for recovery to com
plete
[ 1299.470205] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1299.472403] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1299.474395] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent
1597316595/real 1597316595]  req at ffff89365cee0900 x1674906532108800/t0
(0) o4->snx11167-OST001a-osc-ffff893468b2f800 at 10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316690 ref 2 fl
Rpc:eX/2/ffffffff rc 0/-1
[ 1299.479830] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33 at o2ib20) was lost;
in progress operations using this service will wait for recovery to com
plete
[ 1299.496826] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection restored to 10.148.240.33 at o2ib20 (at 10.148.240.33 at o2ib20)
[ 1301.470102] LustreError: 2478:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1301.472096] LustreError: 2477:0:(events.c:200:client_bulk_callback()) event type 1, status -5, desc ffff89365cebb800
[ 1301.474135] Lustre: 2490:0:(client.c:2133:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent
1597316597/real 1597316597]  req at ffff89365cee0900 x1674906532108800/t0
(0) o4->snx11167-OST001a-osc-ffff893468b2f800 at 10.148.240.33@o2ib20:6/4 lens 488/448 e 0 to 1 dl 1597316692 ref 2 fl
Rpc:eX/2/ffffffff rc 0/-1
[ 1301.479772] Lustre: snx11167-OST001a-osc-ffff893468b2f800: Connection to snx11167-OST001a (at 10.148.240.33 at o2ib20) was lost;
in progress operations using this service will wait for recovery to com
plete
[ 1301.483576] LNetError: 2486:0:(lib-move.c:1999:lnet_handle_find_routed_path()) no route to 10.148.240.33 at o2ib20 from <?>



Access to the Lustre file system which is on the same IB fabric is still possible so I suspect that this is somehow related to
LNET routing.

If I run LNET selftests as explained at http://wiki.lustre.org/LNET_Selftest between one of the LNET routers and the VMs I can see
that RPCs get dropped.


Access from a client running on native hardware is possible for both file systems.



Has someone a comparable setup? What kind of logs is needed to debug this? I'll gladly provide any info…


TL;DR
* native access is possible in the same IB fabric as well as when being routed between different fabrics
* if only one VM is running then access is possible to both file systems, too
* if more VMs are running on the same virtualization host than access is only possible on the file system attached to the same
fabric as the VMs
* access to the routed file system gets stuck



Any help is appreciated.

Thanks,

  Uwe Sauter



More information about the lustre-discuss mailing list