[lustre-discuss] mlx5 errors on oss

Nehring, Shane R [LAS] snehring at iastate.edu
Thu May 18 11:24:26 PDT 2023


That was helpful, thank you.

In our case it's looking like it was a client that was suffering from issues
with the iommu, we were seeing identical AMD-Vi errors on that client. Once this
client was rebooted the errors stopped on the servers. I've set iommu=off since
we don't actually need it enabled on these nodes.


On Thu, 2023-05-18 at 16:19 +0000, Kumar, Amit wrote:
> I had similar issue; it was apparently not a lustre issue for us. In addition
> to the entries, you see below we also saw "AMD-Vi: Event ... IO_PAGE_FAULT "
> in the logs. 
> 
> Setting iommu=pt helped us.
> 
> Hope that helps. 
> 
> Thank you,
> Amit
> 
> -----Original Message-----
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> On Behalf Of
> Nehring, Shane R [LAS] via lustre-discuss
> Sent: Thursday, May 18, 2023 10:06 AM
> To: lustre-discuss at lists.lustre.org
> Subject: [lustre-discuss] mlx5 errors on oss
> 
> Hello all,
> 
> We recently added infiniband to our cluster and are in the process of testing
> it with lustre. We're running the distro provided drivers for the mellanox
> cards with the latest firmware. Overnight we started seeing the following
> errors on a few oss:
> 
> infiniband mlx5_0: dump_cqe:272:(pid 40058): dump error cqe
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000030: 00 00 00 00 00 00 88 13 08 00 00 a0 00 63 4d d2 infiniband mlx5_0:
> dump_cqe:272:(pid 40057): dump error cqe
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000030: 00 00 00 00 00 00 88 13 08 00 00 a1 00 c2 8e d2 infiniband mlx5_0:
> dump_cqe:272:(pid 40057): dump error cqe
> 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 00000030: 00 00 00 00 00 00 88 13 08 00 00 a2 00 1a 12 d2
> 
> I found a post suggesting this might be iommu related, disabling the iommu
> doesn't seem to help any.
> 
> We're running luster 2.15, more or less at the tip of b2_15
> (b74560d74a9f890838dbf2f0719e3d27c1e5eaf8)
> 
> Has anyone seen this before or have any pointers?
> 
> Thanks
> 
> Shane

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6357 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20230518/22a5c1e3/attachment-0001.bin>


More information about the lustre-discuss mailing list