[lustre-discuss] mlx5 errors on oss

Kumar, Amit ahkumar at mail.smu.edu
Thu May 18 09:19:03 PDT 2023


I had similar issue; it was apparently not a lustre issue for us. In addition to the entries, you see below we also saw "AMD-Vi: Event ... IO_PAGE_FAULT " in the logs. 

Setting iommu=pt helped us.

Hope that helps. 

Thank you,
Amit

-----Original Message-----
From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> On Behalf Of Nehring, Shane R [LAS] via lustre-discuss
Sent: Thursday, May 18, 2023 10:06 AM
To: lustre-discuss at lists.lustre.org
Subject: [lustre-discuss] mlx5 errors on oss

Hello all,

We recently added infiniband to our cluster and are in the process of testing it with lustre. We're running the distro provided drivers for the mellanox cards with the latest firmware. Overnight we started seeing the following errors on a few oss:

infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 infiniband mlx5_0: dump_cqe:272:(pid 40057): dump error cqe
00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2

I found a post suggesting this might be iommu related, disabling the iommu doesn't seem to help any.

We're running luster 2.15, more or less at the tip of b2_15
(b74560d74a9f890838dbf2f0719e3d27c1e5eaf8)

Has anyone seen this before or have any pointers?

Thanks

Shane


More information about the lustre-discuss mailing list