[lustre-discuss] Corrupted? MDT not mounting
Andrew Elwell
andrew.elwell at gmail.com
Sun May 8 17:54:21 PDT 2022
On Fri, 6 May 2022 at 20:04, Andreas Dilger <adilger at whamcloud.com> wrote:
> MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a lot more.
Fair enough, However is the 2.12.8-ib tree built with all the features?
specifically https://downloads.whamcloud.com/public/lustre/lustre-2.12.8-ib/MOFED-4.9-4.1.7.0/el7/server/
If I compare the ib_srp module from 2.12 in-kernel
[root at astrofs-oss3 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
[root at astrofs-oss3 ~]# rpm -qf
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
kernel-3.10.0-1160.49.1.el7_lustre.x86_64
[root at astrofs-oss3 ~]# modinfo ib_srp
filename:
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
license: Dual BSD/GPL
description: InfiniBand SCSI RDMA Protocol initiator
author: Roland Dreier
retpoline: Y
rhelversion: 7.9
srcversion: 1FB80E3A962EE7F39AD3959
depends: ib_core,scsi_transport_srp,ib_cm,rdma_cm
intree: Y
vermagic: 3.10.0-1160.49.1.el7_lustre.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key: FA:A3:27:4B:D9:17:36:F0:FD:43:6A:42:1B:6A:A4:FA:FE:D0:AC:FA
sig_hashalgo: sha256
parm: srp_sg_tablesize:Deprecated name for cmd_sg_entries (uint)
parm: cmd_sg_entries:Default number of gather/scatter
entries in the SRP command (default is 12, max 255) (uint)
parm: indirect_sg_entries:Default max number of
gather/scatter entries (default is 12, max is 2048) (uint)
parm: allow_ext_sg:Default behavior when there are more than
cmd_sg_entries S/G entries after mapping; fails the request when false
(default false) (bool)
parm: topspin_workarounds:Enable workarounds for
Topspin/Cisco SRP target bugs if != 0 (int)
parm: prefer_fr:Whether to use fast registration if both FMR
and fast registration are supported (bool)
parm: register_always:Use memory registration even for
contiguous memory regions (bool)
parm: never_register:Never register memory (bool)
parm: reconnect_delay:Time between successive reconnect attempts
parm: fast_io_fail_tmo:Number of seconds between the
observation of a transport layer error and failing all I/O. "off"
means that this functionality is disabled.
parm: dev_loss_tmo:Maximum number of seconds that the SRP
transport should insulate transport layer errors. After this time has
been exceeded the SCSI host is removed. Should be between 1 and
SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fast_io_fail_tmo has not been set.
"off" means that this functionality is disabled.
parm: ch_count:Number of RDMA channels to use for
communication with an SRP target. Using more than one channel improves
performance if the HCA supports multiple completion vectors. The
default value is the minimum of four times the number of online CPU
sockets and the number of completion vectors supported by the HCA.
(uint)
parm: use_blk_mq:Use blk-mq for SRP (bool)
[root at astrofs-oss3 ~]#
.. it all looks normal and capable of mounting our exascaler luns
cf the one from 2.12.8-ib
=============================================================================================================================================================================================
Package Arch
Version
Repository Size
=============================================================================================================================================================================================
Installing:
kernel x86_64
3.10.0-1160.49.1.el7_lustre
lustre-2.12-mofed 50 M
kmod-lustre-osd-ldiskfs x86_64
2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed 469 k
lustre x86_64
2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed 805 k
Installing for dependencies:
kmod-lustre x86_64
2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed 3.9 M
kmod-mlnx-ofa_kernel x86_64
4.9-OFED.4.9.4.1.7.1
lustre-2.12-mofed 1.3 M
lustre-osd-ldiskfs-mount x86_64
2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed 15 k
mlnx-ofa_kernel x86_64
4.9-OFED.4.9.4.1.7.1
lustre-2.12-mofed 108 k
[root at astrofs-oss1 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
[root at astrofs-oss1 ~]# rpm -qf
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
kernel-3.10.0-1160.49.1.el7_lustre.x86_64
[root at astrofs-oss1 ~]# modinfo ib_srp
filename:
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/ulp/srp/ib_srp.ko
version: 4.9-4.1.7
license: Dual BSD/GPL
description: ib_srp dummy kernel module
author: Alaa Hleihel
retpoline: Y
rhelversion: 7.9
srcversion: 9ACAA2F5216D9D9FC379EC8
depends: mlx_compat
vermagic: 3.10.0-1160.49.1.el7_lustre.x86_64 SMP mod_unload modversions
[root at astrofs-oss1 ~]#
which doesn't seem actually be able to take any of the normal ib_srp parameters:
[root at astrofs-oss1 ~]# modprobe ib_srp
modprobe: ERROR: could not insert 'ib_srp': Unknown symbol in module,
or unknown parameter (see dmesg)
[ 238.194931] ib_srp: Unknown parameter `cmd_sg_entries'
etc
Any suggestions? I quickly tried installing another mlnx-ofa_kernel
(from http://downloads.linux.hpe.com/SDR/repo/mlnx_ofed/RHEL/7.9/x86_64/4.9-4.1.7.0/)
but the same.dummy module
More information about the lustre-discuss
mailing list