[lustre-discuss] Question About Mellanox-RDMA On Lustre

王烁斌 w14767780617 at 163.com
Tue Jun 6 02:24:33 PDT 2023


Hi~



I want to establish a dual node Lustre server environment. Use RDMA among them to improve the performance of server response.


After installing Lustre and corresponding drivers that support RDMA, there was an issue during the deployment of the Lustre file system.


When mounting MDS on the second node, the following error occurred:


[root at 172-0-37-83 ~]# mount -t lustre /dev/disk/by-id/scsi-3600b3420371420b645dde619060000aa /mnt/tfs/mgs2
mount.lustre: mount /dev/mapper/mpathcj at /mnt/tfs/mgs2 failed: Connection timed out
Log information:
Jun  6 04:44:25 localhost kernel: LNetError: 23212:0:(o2iblnd.c:819:kiblnd_create_conn()) cmid HCA(mlx5_0), kib_dev(ens14f0np0) need failover
Jun  6 04:44:31 localhost kernel: LNetError: 23213:0:(o2iblnd.c:819:kiblnd_create_conn()) cmid HCA(mlx5_0), kib_dev(ens14f0np0) need failover


I found a similar issue in the community, but it still failed after trying to reload the module。
[LU-7124] MLX5: Limit hit in cap.max_send_wr - Whamcloud Community JIRA


May I ask what is causing this and what changes are needed to solve the problem?


——Shuobin


The following is my configuration and formatting process:
  


node1
 


node2
  









mkfs.lustre --fsname=ltfs1 --mgs --mdt --index=0 --servicenode=192.168.19.14 at o2ib1 --servicenode=192.168.19.15 at o2ib1  --reformat --mkfsoptions "-E stride=32" /dev/disk/by-id/scsi-3600b3420371420b645dde4066c0000a8 

mkfs.lustre --fsname=ltfs1  --mdt --index=1 --mgsnode=192.168.19.14 at o2ib1 --mgsnode=192.168.19.15 at o2ib1 --failnode=192.168.19.15 at o2ib1  --reformat  --mkfsoptions "-E stride=32" /dev/disk/by-id/scsi-3600b3420371420b645dde5093e0000a9

mkfs.lustre --fsname=ltfs1  --mdt --index=2 --mgsnode=192.168.19.15 at o2ib1 --mgsnode=192.168.19.14 at o2ib1 --failnode=192.168.19.14 at o2ib1  --reformat  --mkfsoptions "-E stride=32" /dev/disk/by-id/scsi-3600b3420371420b645dde619060000aa

mkfs.lustre --fsname=ltfs1  --mdt --index=3 --mgsnode=192.168.19.15 at o2ib1 --mgsnode=192.168.19.14 at o2ib1 --failnode=192.168.19.14 at o2ib1  --reformat  --mkfsoptions "-E stride=32" /dev/disk/by-id/scsi-3600b3420371420b645dde7367f0000ab

node1

mount -t lustre /dev/disk/by-id/scsi-3600b3420371420b645dde4066c0000a8 /mnt/tfs/mgs

mount -t lustre /dev/disk/by-id/scsi-3600b3420371420b645dde5093e0000a9 /mnt/tfs/mgs1

node2

mount -t lustre /dev/disk/by-id/scsi-3600b3420371420b645dde619060000aa /mnt/tfs/mgs2

mount -t lustre /dev/disk/by-id/scsi-3600b3420371420b645dde7367f0000ab /mnt/tfs/mgs3


















-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230606/03375874/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 259734 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230606/03375874/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 5976 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230606/03375874/attachment-0003.png>


More information about the lustre-discuss mailing list