[Lustre-discuss] Is OFED 'kernel-ib' required for o2ib on RHEL5?

Marco Aurelio L Gomes mgomes at tpn.usp.br
Tue Mar 23 08:13:10 PDT 2010


Ken,

Thank you very much for your post, it worked!

Regards,

Marco

On Tue, 2010-03-23 at 09:51 -0400, Ken Hornstein wrote:
> >You're right, I had problems with the module symbol versions using
> >Lustre 1.8.2 packages available at Sun website, kernel
> >2.6.18-164.11.1.el5 (RHEL 5.4) and OFED 1.5. The same problems happens
> >when using OFED 1.4.2.
> 
> So since this comes up now and then, I've cc'd the list.
> 
> So you can Google around to find more about kernel symbol versioning.
> The short answer is that there is a CRC associated with each exported
> symbol in the loaded kernel, and that version is recorded in the module
> when it is compiled.  That's all well and good, but figuring out what
> happens when it doesn't work is a pain, because all of the information
> isn't in one place (and nobody has explained it well, at least that
> I've seen).
> 
> When a module (like Lustre) is compiled, it's pointed at a file called
> "Module.symvers"; that contains the versions of the symbols that
> modules are expected to link against, and those versions are recorded
> in the module object file.  When you get this mismatch at module load
> time, one of two things is happening: the "wrong" OFed is being loaded,
> or you linked against the "wrong" Module.symvers file.
> 
> How do you figure out which one is the problem?  Well, let's take a
> common OFed symbol, like rdma_connect.  You can find out the version of
> this symbol by grep'ing /proc/kallsyms.  On our system:
> 
> # grep rdma_connect /proc/kallsyms 
> ffffffffa0375510 u rdma_connect [ko2iblnd]
> ffffffffa0375510 u rdma_connect [rdma_ucm]
> ffffffffa0375510 u rdma_connect [ib_sdp]
> ffffffffa0377000 r __ksymtab_rdma_connect       [rdma_cm]
> ffffffffa0377225 r __kstrtab_rdma_connect       [rdma_cm]
> ffffffffa03770f0 r __kcrctab_rdma_connect       [rdma_cm]
> 000000000ef3a1e8 a __crc_rdma_connect   [rdma_cm]
> ffffffffa0375510 T rdma_connect [rdma_cm]
> 
> The symbol you care about is the absolute symbol, the one prefixed by
> __crc.  So in this case, we are interested in __crc_rdma_connect, and
> that symbol's version is 0x0ef3a1ea.  This is the symbol used by the
> currently running kernel.
> 
> Which version is Lustre linked against?  Well, for that you need to
> find the ko2iblnd.ko file, and dump the __versions section.
> 
> # objdump -s -j __versions ko2iblnd.ko | less
> [...]
> 0670 00000000 00000000 00000000 00000000  ................
> 0680 e8a1f30e 00000000 72646d61 5f636f6e  ........rdma_con
> 0690 6e656374 00000000 00000000 00000000  nect............
> 06a0 00000000 00000000 00000000 00000000  ................
> 
> This display isn't as pretty, but you want to look in the hex dump
> just before the symbol name.  In this case, right before rmda_connect,
> you will see "e8a1f30e" ... which is the little-endian version of our
> symbol version!  So they match up, and everything works.
> 
> If you want to find out which symbol version is in a particular OFed module
> (in this case, we want to look at rdma_cm.ko), you can do this:
> 
> # nm ./kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect
> 00000000cd7aa3e6 A __crc_rdma_connect
> 
> Wrong version!  But we're ACTUALLY using the module located here:
> 
> nm ./updates/kernel/drivers/infiniband/core/rdma_cm.ko | grep rdma_connect
> 000000000ef3a1e8 A __crc_rdma_connect
> 
> Which is the "correct" version.  But if you LINK against the first
> version, you'll get these errors when you try to load Lustre.  Note
> that my Module.symvers file for this kernel contains:
> 
> 0xcd7aa3e6      rdma_connect    drivers/infiniband/core/rdma_cm EXPORT_SYMBOL
> 
> Which is wrong!  In this case, you need to explicitly point Lustre at
> the OFed directory which contains the Module.symvers file.
> 
> (Can you tell I've beaten my head against the wall over this issue
> a WHOLE LOT? :-/)
> 
> --Ken




More information about the lustre-discuss mailing list