[lustre-discuss] Installing lustre 2.15.6 server on rhel-8.10 fails
Audet, Martin
Martin.Audet at cnrc-nrc.gc.ca
Wed Apr 23 19:25:33 PDT 2025
Hello Carlos,
I'm sory that it didn't work.
One question: are you using the precompiled Lustre RPMs (e.g. those available from: https://downloads.whamcloud.com/public/lustre/lustre-2.15.6/ ) or are you compiling your own RPMs from the Lustre git repository ( https://github.com/lustre/lustre-release ) ?
In our case we use the second approach and I think it is better for two reasons:
1- You make sure that everything is consistent, especially with your MOFED environment
2- You are not forced to use the specific versions corresponding to tags exactly, you can chose any version available in git repository or cherry-pick the fixes you think are useful (more details on this later).
In our case we upgraded last week a small HPC cluster using RHEL 8 for the file server and RHEL 9 for the clients. The update was successful and we had no problem related to MOFED, Lustre, PMIx, Slurm, MPI (including MPI-IO) up to now.
Our upgrade is described in a message posted on this mailing list on April 7th:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2025-April/019471.html
As you see we plan also to add additional storage (OSTs) soon by connecting an new MSA 2060 to our file server (this file server play the role of MGS, MDS and OSS). And as you see also we didn't compiled Lustre 2.15.6 exactly. We compiled a commit on the 2.15 branch containing 2.15.6 plus tree additional patches, including LU-18085. Many users, using 2.15.6 without this patch (LU-18085) complained on lustre-discuss and unfortunately it was added to the 2.15 branch only a few days after 2.15.6 was released. Look at this thread for example on lustre-discuss mailing list:
http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2025-April/019474.html
I will now explain you an outline of our procedure to get Lustre on our RHEL 8.10 server. It may be overkill but I think it takes all the precautions and it worked in our case:
1. Install RHEL 8.10 on the system using the base kernel you want to patch (4.18.0-553.27.1 in our case). Don't forget kernel-headers (for compiling MOFED) and have kernel source RPM available (to compile the patched kernel)
2. Compile the MOFED RPMs corresponding to the MOFED version you chose (ex: 24.10-2.1.8.0-LTS) using the mlnx_add_kernel_support.sh script with --kmp option
3. Install the MOFED RPMs (they will uninstall OFED from the Linux distro) (in our case we install: mlnx-ofed-all knem mlnxofed-docs libxpmem-devel)
4. Reboot (to activate the new MOFED)
5. Test MOFED
6. Compile the RPMs corresponding to the patched Lustre kernel (you will need the kernel source)
7. Put the resulting RPMs on a web server and setup an RPM repository (createrepo_c) so that they can be used during the next system installation
8. Re-install the system by making sure that your kickstart file refer to the repository containing the Lustre patched kernel RPMs (they must hide the corresponding distro RPMs) and reboot
9. repeat step 2, to compile a new MOFED since the patched kernel is different
10. repeat step 3 and 4, your system will now have a MOFED that correspond exactly to your kernel patched for Lustre and not to the base kernel (because it is not even installed on the system)
11. repeat step 5 to test the MOFED on the new kernel
12. Compile the server specific RPMs related to Lustre
13. Install those server RPMs (in our case: kmod-lustre kmod-lustre-osd-ldiskfs lustre{,-devel} lustre-iokit lustre-osd-ldiskfs-mount)
14. Configure Lustre (ex: /etc/lnet.conf, /etc/fstab, enable lnet.service)
15. Reboot
16. With little luck the Lustre server should be operational
I hope this helps, good luck !
Martin Audet
________________________________
From: Carlos Adean <carlosadean at linea.org.br>
Sent: April 23, 2025 9:06 PM
To: Audet, Martin; lustre-discuss at lists.lustre.org
Cc: Eloir Troyack
Subject: EXT: Re: [lustre-discuss] Installing lustre 2.15.6 server on rhel-8.10 fails
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hello Martin,
Thank you for the hint.
I tried rebuilding using the suggested parameter, but the warnings persist.
Additionally, the system still fails to boot using the lustre kernel.
We noticed that Lustre's kernel image does not have the megaraid_sas module, which is used by the system to enable the Dell PERC H330 controller. This may be the cause of the boot failure.
[root at mds2 ~]# lsinitrd /boot/initramfs-4.18.0-553.27.1.el8_lustre.x86_64.img | grep megaraid_sas [root at mds2 ~]#
However, this is not true for the kernel image installed via dnf.
[root at mds2 ~]# lsinitrd /boot/initramfs-4.18.0-553.27.1.el8_10.x86_64.img | grep megaraid_sas -rw-r--r-- 1 root root 72560 Jan 15 2024 usr/lib/modules/4.18.0-553.27.1.el8_10.x86_64/kernel/drivers/scsi/megaraid/megaraid_sas.ko.xz [root at mds2 ~]#
I'm still here struggling to install it.
---
Carlos Adean
www.linea.org.br<https://www.linea.org.br>
Em qua., 23 de abr. de 2025 às 09:22, Audet, Martin <Martin.Audet at cnrc-nrc.gc.ca<mailto:Martin.Audet at cnrc-nrc.gc.ca>> escreveu:
Hello,
I think I had a similar problem a long time ago and it was solved by adding the "--kmp" option to "mlnx_add_kernel_support.sh" script when compiling MOFED RPMs. Without this option, the MOFED RPM compilation complete without problems, the same thing when compiling Lustre RPMs but later, when installing Lustre RPMs, we get a bunch of problems related to symbols.
Here is how I compile the MOFED RPMs (uning the root account):
# mount_dir is the temporary mount directory
# ofed_iso is the MOFED .iso file
#
mkdir -p -- $mount_dir
mount -o ro,loop $ofed_iso $mount_dir
$mount_dir/mlnx_add_kernel_support.sh -y --make-tgz --kmp -k $(uname -r) -m $mount_dir
#
# The compiled RPMs are now under /tmp
# ex: /tmp/MLNX_OFED_LINUX-24.10-2.1.8.0-rhel8.10.x86_64-ext.tgz
It seems that the pre-compiled RPMs distributed by Mellanox/NVIDIA are always generated using the --kmp but when using mlnx_add_kernel_support.sh, this option must be explicitly specified. In addition, it seems that with the newer DOCA OFED, the using script equivatent to mlnx_add_kernel_support.sh always add --kmp option on RHEL and similar distributions.
I hope it helps,
Martin
________________________________
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org<mailto:lustre-discuss-bounces at lists.lustre.org>> on behalf of Carlos Adean via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>>
Sent: April 22, 2025 11:09 PM
To: lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>
Cc: Eloir Troyack
Subject: EXT: [lustre-discuss] Installing lustre 2.15.6 server on rhel-8.10 fails
***Attention*** This email originated from outside of the NRC. ***Attention*** Ce courriel provient de l'extérieur du CNRC.
Hello all,
My current version of RHEL 8 is Rocky Linux 8.10, running the kernel 4.18.0-553.27.1.el8_10. I also have the OFED drivers version 24.10-2.1.8.0 installed for the InfiniBand interface (I tried without OFED before).
The installation of "kmod-lustre-2.15.6-1.el8" and "kmod-lustre-osd-ldiskfs-2.15.6-1" always shows these warning messages below.
# dnf --nogpgcheck --enablerepo=lustre-server install kmod-lustre kmod-lustre-osd-ldiskfs lustre-osd-ldiskfs-mount lustre lustre-resource-agents
[...]
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol __ib_alloc_pd
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_resolve_addr
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_dereg_mr_user
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_reject
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_disconnect
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol __rdma_create_kernel_id
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_register_event_handler
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_resolve_route
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_unregister_event_handler
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_bind_addr
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_create_qp
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_map_mr_sg
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_query_port
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_notify
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_listen
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_destroy_qp
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol __ib_create_cq
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_alloc_mr
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_connect_locked
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_set_reuseaddr
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_destroy_cq_user
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_modify_qp
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_dma_virt_map_sg
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_destroy_id
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol rdma_accept
depmod: WARNING: /lib/modules/4.18.0-553.27.1.el8_lustre.x86_64/extra/lustre/net/ko2iblnd.ko needs unknown symbol ib_dealloc_pd_user
[...]
Installed:
kernel-core-4.18.0-553.27.1.el8_lustre.x86_64 kmod-lustre-2.15.6-1.el8.x86_64 kmod-lustre-osd-ldiskfs-2.15.6-1.el8.x86_64 lustre-2.15.6-1.el8.x86_64 lustre-osd-ldiskfs-mount-2.15.6-1.el8.x86_64
lustre-resource-agents-2.15.6-1.el8.x86_64
Completed!
After rebooting, the server drops into an emergency shell because it can't find the LVM devices. This issue only occurs with the Lustre kernel, other installed kernels boot normally.
Any hints on how to proceed?
---
Carlos Adean
www.linea.org.br<https://www.linea.org.br>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20250424/6daa1acc/attachment-0001.htm>
More information about the lustre-discuss
mailing list