[lustre-discuss] Lustre on Mellonax multi-host infiniband problem
Oucharek, Doug S
doug.s.oucharek at intel.com
Mon May 8 08:31:51 PDT 2017
I’m currently investigating a problem with MOFED 4.x which seems very similar to what you seeing. I have no solution yet.
Doug
On May 7, 2017, at 7:09 AM, HM Li <lihm0 at 163.com<mailto:lihm0 at 163.com>> wrote:
Thank you very much.
The MLNX used on Multi-Host nodes is MLNX_OFED_LINUX-4.0-1.0.1.0-rhel7.3-x86_64.
This driver and lustre(git, 2.9.55_45) can work well on other normal FDR nodes.
On 2017年05月06日 01:14, Oucharek, Doug S wrote:
The tag you checked out is missing this fix: https://review.whamcloud.com/#/c/24306/. Try applying that.
Doug
On May 5, 2017, at 9:51 AM, HM Li <lihm0 at 163.com<mailto:lihm0 at 163.com>> wrote:
Conformed.
This is a bug of git(2.9.55_45), it works well when using MLNX_OFED_LINUX-3.4-2.1.8.0-rhel7.3-x86_64 andhttps://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.3.1611/server/SRPMS/lustre-2.9.0-1.src.rpm.
node453 at tc4600: ~# export LC_ALL=C
node453 at tc4600: ~# lctl lustre_build_version
Lustre version: 2.9.0
node453 at tc4600: ~# df
Filesystem 1K-blocks Used Available Use% Mounted on
10.10.100.6 at o2ib:/lxfs 7341068688 10152136 6960149976 1% /home
10.10.100.1 at o2ib:/sgfs 108704716104 24800951320 78393702908 25% /mnt
node453 at tc4600: ~# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.17.1010
node_guid: 46e3:e861:1f19:4438
sys_image_guid: 46e3:e861:1f19:4438
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: SGN1130110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 360
port_lid: 362
port_lmc: 0x00
link_layer: InfiniBand
On 2017年05月03日 17:02, HM Li wrote:
Dear,
I have setup the Lustre-git(2.9.55_45) on CentOS 7.3, but the client(on the same multi-host IB) can't mount lustre. Can you help me? Thank you very much.
* The server:
* mkfs.lustre --fsname=lxfs --mgs --mdt --index=0 --reformat /dev/sda5
* mkfs.lustre --fsname=lxfs --mgsnode=10.10.146.1 at o2ib1 --servicenode=10.10.146.1 at o2ib1 --ost --reformat --index=1 /dev/sda6
* mount -t lustre /dev/sda5 /mnt/mdt
* mount -t lustre /dev/sda6 /mnt/ost
* mount -t lustre -v 10.10.146.1 at o2ib1:/lxfs /home is OK.
* lctl list_nids
10.10.146.1 at o2ib1
* lctl ping 10.10.146.2 at o2ib1
12345-0 at lo
12345-10.10.146.2 at o2ib1
* The client:
* lctl list_nids
10.10.146.2 at o2ib1
* lctl ping 10.10.146.1 at o2ib1
12345-0 at lo
12345-10.10.146.1 at o2ib1
* mount -t lustre -v 10.10.146.1 at o2ib1:/lxfs /home/
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 10.10.146.1 at o2ib1:/lxfs
arg[5] = /home
source = 10.10.146.1 at o2ib1:/lxfs (10.10.146.1 at o2ib1:/lxfs), target = /home
options = rw
mounting device 10.10.146.1 at o2ib1:/lxfs at /home, flags=0x1000000 options=device=10.10.146.1 at o2ib1:/lxfs
mount.lustre: mount 10.10.146.1 at o2ib1:/lxfs at /home failed: Input/output error retries left: 0
mount.lustre: mount 10.10.146.1 at o2ib1:/lxfs at /home failed: Input/output error
Is the MGS running?
* and now on server dmesg show:
[82709.336007] Lustre: MGS: Connection restored to 792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2 at o2ib1)
[82709.339324] mlx5_0:dump_cqe:275:(pid 22740): dump error cqe
[82709.339508] 00000000 00000000 00000000 00000000
[82709.339677] 00000000 00000000 00000000 00000000
[82709.339841] 00000000 00000000 00000000 00000000
[82709.340006] 00000000 9d005304 08000074 01f1c5d2
[82716.343333] Lustre: MGS: Received new LWP connection from 10.10.146.2 at o2ib1, removing former export from same NID
[82716.343712] Lustre: MGS: Connection restored to 792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2 at o2ib1)
* IB information:
* ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.17.1010
Hardware version: 0
Node GUID: 0x46e3e8611f19443a
System image GUID: 0x46e3e8611f194438
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 360
LMC: 2
SM lid: 360
Capability mask: 0x2651e84a
Port GUID: 0x46e3e8611f19443a
Link layer: InfiniBand
* ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.17.1010
node_guid: 46e3:e861:1f19:443a
sys_image_guid: 46e3:e861:1f19:4438
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: SGN1130110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 360
port_lid: 360
port_lmc: 0x02
link_layer: InfiniBand
* ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: fe80:0000:0000:0000:46e3:e861:1f19:443a
base lid: 0x168
sm lid: 0x168
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (4X EDR)
link_layer: InfiniBand
* opensm --create-config /etc/opensm/opensm.conf
* /etc/opensm/opensm.conf has been modified:
virt enabled 2
qos TRUE
lmc 2
* Other information:
* selinux disabled
* iptables cleaned
* uname -r: 3.10.0-514.16.1.el7_lustre.x86_64
* OS: CentOS 7.3
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170508/cf4ae211/attachment-0001.htm>
More information about the lustre-discuss
mailing list