[lustre-discuss] mlx5 errors on oss

Nehring, Shane R [LAS] snehring at iastate.edu
Thu May 18 08:06:21 PDT 2023


Hello all,

We recently added infiniband to our cluster and are in the process of testing it
with lustre. We're running the distro provided drivers for the mellanox cards
with the latest firmware. Overnight we started seeing the following errors on a
few oss:

infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2
infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2
infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2

I found a post suggesting this might be iommu related, disabling the iommu
doesn't seem to help any.

We're running luster 2.15, more or less at the tip of b2_15
(b74560d74a9f890838dbf2f0719e3d27c1e5eaf8)

Has anyone seen this before or have any pointers?

Thanks

Shane
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6357 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230518/89fb458c/attachment-0001.bin>


More information about the lustre-discuss mailing list