[lustre-discuss] ldlm.lock_limit_mb sizing
Cameron Harr
harr1 at llnl.gov
Thu Jul 25 13:18:19 PDT 2024
Sorry for the late reply.
The way I was looking at the information was through `llstat -i5
ldlm.services.ldlm_canceld.stats`
I don't have a copy of the data in the state it was in while it was
"overwhelmed" but on multiple MDT nodes (that were pinned at near 100%
CPU for hours with I/O stopped), I could see req_waittime (shown in
usec) exeed 6000 *seconds* and req_qdepth (what I called locks in
request queue) around the aforementioned 500K. If we restarted Lustre on
one of these MDS nodes, the req_depth would steadily climb back up,
along with CPU load and those symptoms would follow to the peer server
if we failed over the MDT.
What is the correct parameter to modify requests in flight from the MDT
to OST? I didn't find that that looked appropriate.
Thanks,
Cameron
On 7/19/24 1:32 PM, Kulyavtsev, Alex Ivanovich wrote:
> Oleg, Cameron,
> how to look at counts / list of requests queue (ungranted lock), request wait time ?
>
> Can you please point to parameter names to check first for troubleshooting and to monitor.
> I’m looking at parameters below but not sure about meaning or entry format.
>
> ldlm.lock_granted_count
>
> ldlm.services.ldlm_canceld.req_history
> ldlm.services.ldlm_canceld.stats
> ldlm.services.ldlm_canceld.timeouts
>
> ldlm.services.ldlm_cbd.req_history
> ldlm.services.ldlm_cbd.stats
> ldlm.services.ldlm_cbd.timeouts
>
> mdt.*.exports.*.ldlm_stats
> obdfilter.*.exports.*.ldlm_stats
>
> Anything to look at `ldlm.namespaces` ?
>
> Best regards, Alex.
>
>> On Jul 17, 2024, at 20:56, Oleg Drokin via lustre-discuss <lustre-discuss at lists.lustre.org> wrote:
>>
>> This Message Is From an External Sender
>> This message came from outside your organization.
>> On Wed, 2024-07-17 at 12:58 -0700, Cameron Harr via lustre-discuss
>> wrote:
>>> In 2017, Oleg gave a talk at ORNL's Lustre conference about LDLM,
>>> including references to ldlm.lock_limit _mb and
>>> ldlm.lock_reclaim_threshold_mb.
>>>
>> https://urldefense.us/v3/__https://lustre.ornl.gov/ecosystem-2017/documents/Day-2_Tutorial-4_Drokin.pdf__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NVn6b0RaA$
>>
>>> The apparent defaults back then in Lustre 2.8 for those two
>>> parameters
>>> were 30MB and 20MB, respectively. On my 2.15 servers with 256GB and
>>> no
>>> changes from us, I'm seeing numbers of 77244MB and 51496MB,
>>> respectively. We recently got ourselves into a situation where a
>>> subset
>>> of MDTs appeared to be entirely overwhelmed trying to cancel locks,
>>> with
>>> ~500K locks in the request queue but a request wait time of 6000
>>> seconds. So, we're looking at potentially limiting the locks on the
>>> servers.
>>>
>>> What's the formula for appropriately sizing ldlm.lock_limit _mb and
>>> ldlm.lock_reclaim_threshold_mb in 2.15 (I don't think node memory
>>> amounts have increased 20000X in 7 years)?
>> What do you mean by the "locks in the request queue"? If you mean your
>> server has got that many ungranted locks, there's nothing you can
>> really do here - that's how many outstanding client requests you've
>> got.
>>
>> Sure, you can turn clients away, but probably could be more productive
>> to make sure your cancels are quicker?
>>
>> I think I've seen cases recently with servers gummed up with requests
>> for creations being stuck waiting on OSTs to create more objects, while
>> holding various dlm locks (= other threads that wanted to access these
>> directories getting stuck too) while OSTs getting super slow because of
>> an influx of (pretty expensive) destroy requests to delete objects from
>> unlinked files.
>> In the end dropping requests in flight from MDTs to OSTs helped much
>> more by making sure OSTs were doing their creates faster so MDTs were
>> blocking much less.
>> _______________________________________________
>> lustre-discuss mailing list
>>
>> lustre-discuss at lists.lustre.org
>> https://urldefense.us/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!G_uCfscf7eWS!bAKFMeyE7sSlS07D-Xg3QWp90v8S2IQDhmAFhrPR86dHuUwyGB2zJXOZGIHTTrGU0FS2cUWfQJ-zktshrFBJ3NWjzv8aOA$
More information about the lustre-discuss
mailing list