[lustre-discuss] ldlm.lock_limit_mb sizing
Kulyavtsev, Alex Ivanovich
alexku at anl.gov
Fri Jul 19 13:32:11 PDT 2024
Oleg, Cameron,
how to look at counts / list of requests queue (ungranted lock), request wait time ?
Can you please point to parameter names to check first for troubleshooting and to monitor.
I’m looking at parameters below but not sure about meaning or entry format.
ldlm.lock_granted_count
ldlm.services.ldlm_canceld.req_history
ldlm.services.ldlm_canceld.stats
ldlm.services.ldlm_canceld.timeouts
ldlm.services.ldlm_cbd.req_history
ldlm.services.ldlm_cbd.stats
ldlm.services.ldlm_cbd.timeouts
mdt.*.exports.*.ldlm_stats
obdfilter.*.exports.*.ldlm_stats
Anything to look at `ldlm.namespaces` ?
Best regards, Alex.
> On Jul 17, 2024, at 20:56, Oleg Drokin via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>
> This Message Is From an External Sender
> This message came from outside your organization.
> On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss
> wrote:
> > In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM,
> > including references to ldlm.lock_limit _mb and
> > ldlm.lock_reclaim_threshold_mb.
> >
> https://urldefense.us/v3/__https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NVn6b0RaA$
>
> >
> > The apparent defaults back then in Lustre 2.8 for those two
> > parameters
> > were 30MB and 20MB, respectively. On my 2.15 servers with 256GB and
> > no
> > changes from us, I'm seeing numbers of 77244MB and 51496MB,
> > respectively. We recently got ourselves into a situation where a
> > subset
> > of MDTs appeared to be entirely overwhelmed trying to cancel locks,
> > with
> > ~500K locks in the request queue but a request wait time of 6000
> > seconds. So, we're looking at potentially limiting the locks on the
> > servers.
> >
> > What's the formula for appropriately sizing ldlm.lock_limit _mb and
> > ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory
> > amounts have increased 20000X in 7 years)?
>
> What do you mean by the "locks in the request queue"? If you mean your
> server has got that many ungranted locks, there's nothing you can
> really do here - that's how many outstanding client requests you've
> got.
>
> Sure, you can turn clients away, but probably could be more productive
> to make sure your cancels are quicker?
>
> I think I've seen cases recently with servers gummed up with requests
> for creations being stuck waiting on OSTs to create more objects, while
> holding various dlm locks (= other threads that wanted to access these
> directories getting stuck too) while OSTs getting super slow because of
> an influx of (pretty expensive) destroy requests to delete objects from
> unlinked files.
> In the end dropping requests in flight from MDTs to OSTs helped much
> more by making sure OSTs were doing their creates faster so MDTs were
> blocking much less.
> _______________________________________________
> lustre-discuss mailing list
>
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NWjzv8aOA$
More information about the lustre-discuss
mailing list