[lustre-discuss] ldlm.lock_limit_mb sizing

Fri Jul 19 13:32:11 PDT 2024

Oleg, Cameron,
how to look at counts / list of requests queue (ungranted lock), request wait  time ?

Can you please  point to parameter names to check first for troubleshooting and to monitor.
I’m looking at parameters below but not sure about meaning or entry format.

ldlm.lock_granted_count

ldlm.services.ldlm_canceld.req_history
ldlm.services.ldlm_canceld.stats
ldlm.services.ldlm_canceld.timeouts

ldlm.services.ldlm_cbd.req_history
ldlm.services.ldlm_cbd.stats
ldlm.services.ldlm_cbd.timeouts

mdt.*.exports.*.ldlm_stats
obdfilter.*.exports.*.ldlm_stats

Anything to look at `ldlm.namespaces` ?

Best regards, Alex.

> On Jul 17, 2024, at 20:56, Oleg Drokin via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
> 
> This Message Is From an External Sender
> This message came from outside your organization.
> On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss
> wrote:
> > In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM, 
> > including references to ldlm.lock_limit _mb and 
> > ldlm.lock_reclaim_threshold_mb. 
> > 
> https://urldefense.us/v3/__https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NVn6b0RaA$
> 
> > 
> > The apparent defaults back then in Lustre 2.8 for those two
> > parameters 
> > were 30MB and 20MB, respectively.  On my 2.15 servers with 256GB and
> > no 
> > changes from us, I'm seeing numbers of 77244MB and 51496MB, 
> > respectively. We recently got ourselves into a situation where a
> > subset 
> > of MDTs appeared to be entirely overwhelmed trying to cancel locks,
> > with 
> > ~500K locks in the request queue but a request wait time of 6000 
> > seconds. So, we're looking  at potentially limiting the locks on the 
> > servers.
> > 
> > What's the formula for appropriately sizing ldlm.lock_limit _mb and 
> > ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory 
> > amounts have increased 20000X in 7 years)?
> 
> What do you mean by the "locks in the request queue"? If you mean your
> server has got that many ungranted locks, there's nothing you can
> really do here - that's how many outstanding client requests you've
> got.
> 
> Sure, you can turn clients away, but probably could be more productive
> to make sure your cancels are quicker?
> 
> I think I've seen cases recently with servers gummed up with requests
> for creations being stuck waiting on OSTs to create more objects, while
> holding various dlm locks (= other threads that wanted to access these
> directories getting stuck too) while OSTs getting super slow because of
> an influx of (pretty expensive) destroy requests to delete objects from
> unlinked files.
> In the end dropping requests in flight from MDTs to OSTs helped much
> more by making sure OSTs were doing their creates faster so MDTs were
> blocking much less.
> _______________________________________________
> lustre-discuss mailing list
> 
> lustre-discuss at lists.lustre.org
> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NWjzv8aOA$