[lustre-discuss] 'queue depth too large', but connection works
chris.horn at hpe.com
Thu Feb 3 08:17:17 PST 2022
No, it is not necessary to tune map_on_demand with modern NICs/MOFED drivers. Latest Lustre can only accept values of ‘0’ or ‘1’. This forces (‘0’) the use of global memory regions (when available), but global MR API was removed (or deprecated?) by Mellanox. A recent change was made to default map_on_demand to 1 (LU-15186) so FMR/FastReg is used by default even if global MR is available. We expect global MR to be removed completely at some point.
From: Thomas Roth <t.roth at gsi.de>
Date: Monday, January 31, 2022 at 5:31 AM
To: Horn, Chris <chris.horn at hpe.com>, lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
Subject: Re: [lustre-discuss] 'queue depth too large', but connection works
Digging a bit more into the ko2iblnd parameters, it seems the default for 'map_on_demand' comes out as '1' - both on mlx4 and mlx5 boxes.
I was reading about earlier issues with in rdma, which supposedly pushed the default to 256 - but that was perhaps to long ago.
Is it necessary to tune this parameter nowadays?
On 1/30/22 20:41, Horn, Chris wrote:
> Yes, this means the server has peer_credits=8, so can only accept that value. It informs the client of this so subsequent client connection attempt uses the lower value.
> From: lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on behalf of Thomas Roth <t.roth at gsi.de>
> Sent: Saturday, January 29, 2022 11:46 AM
> To: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
> Subject: [lustre-discuss] 'queue depth too large', but connection works
> Dear all,
> test system: servers 2.12.7, and a client 2.12.6., all mlx4.
> The client has some non-default ko2iblnd parameters, including "peer_credits=16".
> I mounted my test system there and happily copied around some directories. Only afterwards I found
> > LNetError: 5278:0:(o2iblnd_cb.c:2551:kiblnd_passive_connect()) Can't accept conn from 10.20.3.64 at o2ib6, queue depth too large: 16 (<=8 wanted)
> in the MDS log.
> I did read LU-3322, but obviously did not the point. "Can't accept conn" used to deny client access, but the MDS that didn't like my client just
> created some ~25k new objects on behalf of that client.
> Does this mean client and server negotiate a suitable value, but behind the scenes?
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss