[lustre-discuss] Lustre on Mellonax multi-host infiniband problem

Oucharek, Doug S doug.s.oucharek at intel.com
Mon May 8 08:31:51 PDT 2017


I’m currently investigating a problem with MOFED 4.x which seems very similar to what you seeing.  I have no solution yet.

Doug

On May 7, 2017, at 7:09 AM, HM Li <lihm0 at 163.com<mailto:lihm0 at 163.com>> wrote:


Thank you very much.

The MLNX used on Multi-Host nodes is MLNX_OFED_LINUX-4.0-1.0.1.0-rhel7.3-x86_64.

This driver and lustre(git, 2.9.55_45) can work well on other normal FDR nodes.

On 2017年05月06日 01:14, Oucharek, Doug S wrote:
The tag you checked out is missing this fix: https://review.whamcloud.com/#/c/24306/.  Try applying that.

Doug

On May 5, 2017, at 9:51 AM, HM Li <lihm0 at 163.com<mailto:lihm0 at 163.com>> wrote:


Conformed.

This is a bug of git(2.9.55_45), it works well when using MLNX_OFED_LINUX-3.4-2.1.8.0-rhel7.3-x86_64 andhttps://downloads.hpdd.intel.com/public/lustre/latest-feature-release/el7.3.1611/server/SRPMS/lustre-2.9.0-1.src.rpm.

node453 at tc4600: ~# export LC_ALL=C
node453 at tc4600: ~# lctl lustre_build_version
Lustre version: 2.9.0
node453 at tc4600: ~# df
Filesystem                1K-blocks        Used   Available Use% Mounted on
10.10.100.6 at o2ib:/lxfs   7341068688    10152136  6960149976   1% /home
10.10.100.1 at o2ib:/sgfs 108704716104 24800951320 78393702908  25% /mnt
node453 at tc4600: ~# ibv_devinfo
hca_id:    mlx5_0
    transport:            InfiniBand (0)
    fw_ver:                12.17.1010
    node_guid:            46e3:e861:1f19:4438
    sys_image_guid:            46e3:e861:1f19:4438
    vendor_id:            0x02c9
    vendor_part_id:            4115
    hw_ver:                0x0
    board_id:            SGN1130110032
    phys_port_cnt:            1
    Device ports:
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            360
            port_lid:        362
            port_lmc:        0x00
            link_layer:        InfiniBand


On 2017年05月03日 17:02, HM Li wrote:
Dear,
I have setup the Lustre-git(2.9.55_45) on CentOS 7.3, but the client(on the same multi-host IB) can't mount lustre. Can you help me? Thank you very much.

  *   The server:
     *   mkfs.lustre --fsname=lxfs --mgs --mdt --index=0  --reformat /dev/sda5
     *   mkfs.lustre --fsname=lxfs --mgsnode=10.10.146.1 at o2ib1 --servicenode=10.10.146.1 at o2ib1 --ost --reformat --index=1 /dev/sda6
     *   mount -t lustre /dev/sda5 /mnt/mdt
     *   mount -t lustre /dev/sda6 /mnt/ost
     *   mount -t lustre -v  10.10.146.1 at o2ib1:/lxfs /home is OK.
     *   lctl list_nids
10.10.146.1 at o2ib1
     *   lctl ping 10.10.146.2 at o2ib1
12345-0 at lo
12345-10.10.146.2 at o2ib1

  *   The client:
     *   lctl list_nids
10.10.146.2 at o2ib1
     *   lctl ping 10.10.146.1 at o2ib1
12345-0 at lo
12345-10.10.146.1 at o2ib1
     *   mount -t lustre -v 10.10.146.1 at o2ib1:/lxfs /home/
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 10.10.146.1 at o2ib1:/lxfs
arg[5] = /home
source = 10.10.146.1 at o2ib1:/lxfs (10.10.146.1 at o2ib1:/lxfs), target = /home
options = rw
mounting device 10.10.146.1 at o2ib1:/lxfs at /home, flags=0x1000000 options=device=10.10.146.1 at o2ib1:/lxfs
mount.lustre: mount 10.10.146.1 at o2ib1:/lxfs at /home failed: Input/output error retries left: 0
mount.lustre: mount 10.10.146.1 at o2ib1:/lxfs at /home failed: Input/output error
Is the MGS running?

  *   and now on server dmesg show:
[82709.336007] Lustre: MGS: Connection restored to 792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2 at o2ib1)
[82709.339324] mlx5_0:dump_cqe:275:(pid 22740): dump error cqe
[82709.339508] 00000000 00000000 00000000 00000000
[82709.339677] 00000000 00000000 00000000 00000000
[82709.339841] 00000000 00000000 00000000 00000000
[82709.340006] 00000000 9d005304 08000074 01f1c5d2
[82716.343333] Lustre: MGS: Received new LWP connection from 10.10.146.2 at o2ib1, removing former export from same NID
[82716.343712] Lustre: MGS: Connection restored to 792b2b21-2e57-de7d-3d8f-5e80eb6d7bf2 (at 10.10.146.2 at o2ib1)

  *   IB information:
     *   ibstat
CA 'mlx5_0'
    CA type: MT4115
    Number of ports: 1
    Firmware version: 12.17.1010
    Hardware version: 0
    Node GUID: 0x46e3e8611f19443a
    System image GUID: 0x46e3e8611f194438
    Port 1:
        State: Active
        Physical state: LinkUp
        Rate: 100
        Base lid: 360
        LMC: 2
        SM lid: 360
        Capability mask: 0x2651e84a
        Port GUID: 0x46e3e8611f19443a
        Link layer: InfiniBand
     *   ibv_devinfo
hca_id:    mlx5_0
    transport:            InfiniBand (0)
    fw_ver:                12.17.1010
    node_guid:            46e3:e861:1f19:443a
    sys_image_guid:            46e3:e861:1f19:4438
    vendor_id:            0x02c9
    vendor_part_id:            4115
    hw_ver:                0x0
    board_id:            SGN1130110032
    phys_port_cnt:            1
    Device ports:
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            360
            port_lid:        360
            port_lmc:        0x02
            link_layer:        InfiniBand
     *   ibstatus
Infiniband device 'mlx5_0' port 1 status:
    default gid:     fe80:0000:0000:0000:46e3:e861:1f19:443a
    base lid:     0x168
    sm lid:         0x168
    state:         4: ACTIVE
    phys state:     5: LinkUp
    rate:         100 Gb/sec (4X EDR)
    link_layer:     InfiniBand
     *   opensm --create-config /etc/opensm/opensm.conf
     *   /etc/opensm/opensm.conf has been modified:
virt enabled 2
qos TRUE
lmc 2

  *   Other information:
     *   selinux disabled
     *   iptables cleaned
     *   uname -r: 3.10.0-514.16.1.el7_lustre.x86_64
     *   OS: CentOS 7.3



_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


_______________________________________________
lustre-discuss mailing list
lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20170508/cf4ae211/attachment-0001.htm>


More information about the lustre-discuss mailing list