[lustre-discuss] kernel threads for rpcs in flight

Thu May 2 17:10:30 PDT 2024

> The number of ptlrpc threads per CPT is set by the 
> "ptlrpcd_partner_group_size" module parameter, and defaults to 2 
> threads per CPT, IIRC.  I don't think that clients dynamically 
> start/stop ptlrpcd threads at runtime.
> When there are RPCs in the queue for any ptlrpcd it will be woken up 
> and scheduled by the kernel, so it will compete with the application 
> threads.  IIRC, if a ptlrpcd thread is woken up and there are no RPCs 
> in the local CPT queue it will try to steal RPCs from another CPT on 
> the assumption that the local CPU is not generating any RPCs so it 
> would be beneficial to offload threads on another CPU that *is* 
> generating RPCs.  If the application thread is extremely CPU hungry, 
> then the kernel will not schedule the ptlrpcd threads on those codes 
> very often, and the "idle" core ptlrpcd threads will be be able to run 
> more frequently.

Sorry, maybe I am confusing things. I am still not sure how many threads 
I get.
For example I have a 32 cores AMD Epyc machine as a client and I am 
running a serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it 
something on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads 
at system start, right? Now I set  rpcs_in_flight to 1 or to 8, what 
effect does that have on the number and the activity of the threads?
Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 
remain inactive/sleep/do nothing?

Does not seem to be the case, as I've applied the rpctracing (thanks a 
lot for the hint!!), and rpcs_in_flight being 1 still show at least 3 
different threads from at least 2 different partitions for writing a 1MB 
file with ten blocks.
I don't get the relationship between these values.

And, if I had compression or any other heavy load, which settings could 
clearly control how many resources I want to give Lustre for this load? 
I can see a clear scaling with higher rpcs in flight, but I am 
struggeling to understand the numbers and attribute them to a specific 
settings. Uncompressed case already benefits a bit by higher RPCs number 
due to multiple "substreaming", but there must be much more happening in 
parallel behind the scenes for compressed case even with rpcs_in_flight=1.

Thank you!

Anna

>
> Whether this behavior is optimal or not is subject to debate, and 
> investigation/improvements are of course welcome.  Definitely, data 
> checksums have some overhead (a few percent), and client-side data 
> compression (which is done by ptlrpcd threads) would have a 
> significant usage of CPU cycles, but given the large number of CPU 
> cores on client nodes these days this may still provide a net 
> performance benefit if the IO bottleneck is on the server.
>
>>>> With |max_||rpcs_in_flight = 1|, multiple cores are loaded, 
>>>> presumably alternately, but the statistics are too inaccurate to 
>>>> capture this.  The distribution of threads to cores is regulated by 
>>>> the Linux kernel, right? Does anyone have experience with what 
>>>> happens when all CPUs are under full load with the application or 
>>>> something else?
>>>
>>> Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* 
>>> parameter, so a single client can still have tens or hundreds of 
>>> RPCs in flight to different servers.  The client will send many RPC 
>>> types directly from the process context, since they are waiting on 
>>> the result anyway.  For asynchronous bulk RPCs, the ptlrpcd thread 
>>> will try to process the bulk IO on the same CPT (= Lustre CPU 
>>> Partition Table, roughly aligned to NUMA nodes) as the userspace 
>>> application was running when the request was created.  This 
>>> minimizes the cross-NUMA traffic when accessing pages for bulk RPCs, 
>>> so long as those cores are not busy with userspace tasks. 
>>>  Otherwise, the ptlrpcd thread on another CPT will steal RPCs from 
>>> the queues.
>>>
>>>> Do the Lustre threads suffer? Is there a prioritization of the 
>>>> Lustre threads over other tasks?
>>>
>>> Are you asking about the client or the server?  Many of the client 
>>> RPCs are generated by the client threads, but for the running 
>>> ptlrpcd threads do not have a higher priority than client 
>>> application threads.  If the application threads are running on some 
>>> cores, but other cores are idle, then the ptlrpcd threads on other 
>>> cores will try to process the RPCs to allow the application threads 
>>> to continue running there.  Otherwise, if all cores are busy (as is 
>>> typical for HPC applications) then they will be scheduled by the 
>>> kernel as needed.
>>>
>>>> Are there readily available statistics or tools for this scenario?
>>>
>>> What statistics are you looking for?  There are "{osc,mdc}.*.stats" 
>>> and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC 
>>> counts and latency.
>>
>> Oh, right, these tell a lot. Isn't there also something to log the 
>> utilization and location of these threads? Otherwise, I'll continue 
>> trying with perf, which seems to be more complex with kernel threads.
>
> There are kernel debug logs available when "lctl set_param 
> debug=+rpctrace" is enabled, that will show which ptlrpcd or 
> application thread is handling each RPC, and on which core it was run 
> on.  These can be found on the client by searching for "Sending 
> RPC|Completed RPC" in the debug logs, for example:
>
> # lctl set_param debug=+rpctrace
> # lctl set_param jobid_var=procname_uid
> # cp -a /etc /mnt/testfs
> # lctl dk /tmp/debug
> # grep -E "Sending RPC|Completed RPC" /tmp/debug
>     :
>     :
> 00000100:00100000:2.0:1714502851.435000:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
>      Sending RPC req at ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0 at lo:2:cp.0
> 00000100:00100000:2.0:1714502851.436117:0:23892:0:(client.c:2239:ptlrpc_check_set())
>      Completed RPC req at ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0 at lo:2:cp.0
>
> Shows that thread "ptlrpcd_01_00" (CPT 01, thread 00, pid 23892) was 
> sent on core 2.0 (no hyperthread) and sent an OST_SETATTR (opc = 2) 
> RPC on behalf of "cp" for root (uid=0), and competed in 1117msec.
>
> Similarly, with a "dd" sync write workload it shows write RPCs by the 
> ptlrpcd threads, and sync RPCs in the "dd" process context:
> # dd if=/dev/zero of=/mnt/testfs/file bs=4k count=10000 oflag=dsync
> # lctl dk /tmp/debug
> # grep -E "Sending RPC|Completed RPC" /tmp/debug
>     :
>     :
> 00000100:00100000:2.0:1714503761.136971:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
>      Sending RPC req at ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0 at lo:4:dd.0
> 00000100:00100000:2.0:1714503761.140288:0:23892:0:(client.c:2239:ptlrpc_check_set())
>      Completed RPC req at ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0 at lo:4:dd.0
> 00000100:00100000:2.0:1714503761.140518:0:17993:0:(client.c:1758:ptlrpc_send_new_req())
>      Sending RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0 at lo:44:dd.0
> 00000100:00100000:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
>      Completed RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0 at lo:44:dd.0
> 00000100:00100000:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
>      Sending RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0 at lo:16:dd.0
> 00000100:00100000:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
>      Completed RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0 at lo:16:dd.0
>
> There are no stats files that aggregate information about ptlrpcd 
> thread utilization.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240503/e18981e0/attachment-0001.htm>