[lustre-discuss] kernel threads for rpcs in flight
Anna Fuchs
anna.fuchs at uni-hamburg.de
Thu May 2 17:10:30 PDT 2024
> The number of ptlrpc threads per CPT is set by the
> "ptlrpcd_partner_group_size" module parameter, and defaults to 2
> threads per CPT, IIRC. I don't think that clients dynamically
> start/stop ptlrpcd threads at runtime.
> When there are RPCs in the queue for any ptlrpcd it will be woken up
> and scheduled by the kernel, so it will compete with the application
> threads. IIRC, if a ptlrpcd thread is woken up and there are no RPCs
> in the local CPT queue it will try to steal RPCs from another CPT on
> the assumption that the local CPU is not generating any RPCs so it
> would be beneficial to offload threads on another CPU that *is*
> generating RPCs. If the application thread is extremely CPU hungry,
> then the kernel will not schedule the ptlrpcd threads on those codes
> very often, and the "idle" core ptlrpcd threads will be be able to run
> more frequently.
Sorry, maybe I am confusing things. I am still not sure how many threads
I get.
For example I have a 32 cores AMD Epyc machine as a client and I am
running a serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it
something on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.
Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads
at system start, right? Now I set rpcs_in_flight to 1 or to 8, what
effect does that have on the number and the activity of the threads?
Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3
remain inactive/sleep/do nothing?
Does not seem to be the case, as I've applied the rpctracing (thanks a
lot for the hint!!), and rpcs_in_flight being 1 still show at least 3
different threads from at least 2 different partitions for writing a 1MB
file with ten blocks.
I don't get the relationship between these values.
And, if I had compression or any other heavy load, which settings could
clearly control how many resources I want to give Lustre for this load?
I can see a clear scaling with higher rpcs in flight, but I am
struggeling to understand the numbers and attribute them to a specific
settings. Uncompressed case already benefits a bit by higher RPCs number
due to multiple "substreaming", but there must be much more happening in
parallel behind the scenes for compressed case even with rpcs_in_flight=1.
Thank you!
Anna
>
> Whether this behavior is optimal or not is subject to debate, and
> investigation/improvements are of course welcome. Definitely, data
> checksums have some overhead (a few percent), and client-side data
> compression (which is done by ptlrpcd threads) would have a
> significant usage of CPU cycles, but given the large number of CPU
> cores on client nodes these days this may still provide a net
> performance benefit if the IO bottleneck is on the server.
>
>>>> With |max_||rpcs_in_flight = 1|, multiple cores are loaded,
>>>> presumably alternately, but the statistics are too inaccurate to
>>>> capture this. The distribution of threads to cores is regulated by
>>>> the Linux kernel, right? Does anyone have experience with what
>>>> happens when all CPUs are under full load with the application or
>>>> something else?
>>>
>>> Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target*
>>> parameter, so a single client can still have tens or hundreds of
>>> RPCs in flight to different servers. The client will send many RPC
>>> types directly from the process context, since they are waiting on
>>> the result anyway. For asynchronous bulk RPCs, the ptlrpcd thread
>>> will try to process the bulk IO on the same CPT (= Lustre CPU
>>> Partition Table, roughly aligned to NUMA nodes) as the userspace
>>> application was running when the request was created. This
>>> minimizes the cross-NUMA traffic when accessing pages for bulk RPCs,
>>> so long as those cores are not busy with userspace tasks.
>>> Otherwise, the ptlrpcd thread on another CPT will steal RPCs from
>>> the queues.
>>>
>>>> Do the Lustre threads suffer? Is there a prioritization of the
>>>> Lustre threads over other tasks?
>>>
>>> Are you asking about the client or the server? Many of the client
>>> RPCs are generated by the client threads, but for the running
>>> ptlrpcd threads do not have a higher priority than client
>>> application threads. If the application threads are running on some
>>> cores, but other cores are idle, then the ptlrpcd threads on other
>>> cores will try to process the RPCs to allow the application threads
>>> to continue running there. Otherwise, if all cores are busy (as is
>>> typical for HPC applications) then they will be scheduled by the
>>> kernel as needed.
>>>
>>>> Are there readily available statistics or tools for this scenario?
>>>
>>> What statistics are you looking for? There are "{osc,mdc}.*.stats"
>>> and "{osc,mdc}.*rpc_stats" that have aggregate information about RPC
>>> counts and latency.
>>
>> Oh, right, these tell a lot. Isn't there also something to log the
>> utilization and location of these threads? Otherwise, I'll continue
>> trying with perf, which seems to be more complex with kernel threads.
>
> There are kernel debug logs available when "lctl set_param
> debug=+rpctrace" is enabled, that will show which ptlrpcd or
> application thread is handling each RPC, and on which core it was run
> on. These can be found on the client by searching for "Sending
> RPC|Completed RPC" in the debug logs, for example:
>
> # lctl set_param debug=+rpctrace
> # lctl set_param jobid_var=procname_uid
> # cp -a /etc /mnt/testfs
> # lctl dk /tmp/debug
> # grep -E "Sending RPC|Completed RPC" /tmp/debug
> :
> :
> 00000100:00100000:2.0:1714502851.435000:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
> Sending RPC req at ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0 at lo:2:cp.0
> 00000100:00100000:2.0:1714502851.436117:0:23892:0:(client.c:2239:ptlrpc_check_set())
> Completed RPC req at ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0 at lo:2:cp.0
>
> Shows that thread "ptlrpcd_01_00" (CPT 01, thread 00, pid 23892) was
> sent on core 2.0 (no hyperthread) and sent an OST_SETATTR (opc = 2)
> RPC on behalf of "cp" for root (uid=0), and competed in 1117msec.
>
> Similarly, with a "dd" sync write workload it shows write RPCs by the
> ptlrpcd threads, and sync RPCs in the "dd" process context:
> # dd if=/dev/zero of=/mnt/testfs/file bs=4k count=10000 oflag=dsync
> # lctl dk /tmp/debug
> # grep -E "Sending RPC|Completed RPC" /tmp/debug
> :
> :
> 00000100:00100000:2.0:1714503761.136971:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
> Sending RPC req at ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0 at lo:4:dd.0
> 00000100:00100000:2.0:1714503761.140288:0:23892:0:(client.c:2239:ptlrpc_check_set())
> Completed RPC req at ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0 at lo:4:dd.0
> 00000100:00100000:2.0:1714503761.140518:0:17993:0:(client.c:1758:ptlrpc_send_new_req())
> Sending RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0 at lo:44:dd.0
> 00000100:00100000:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
> Completed RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0 at lo:44:dd.0
> 00000100:00100000:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
> Sending RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0 at lo:16:dd.0
> 00000100:00100000:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
> Completed RPC req at ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
> ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0 at lo:16:dd.0
>
> There are no stats files that aggregate information about ptlrpcd
> thread utilization.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20240503/e18981e0/attachment-0001.htm>
More information about the lustre-discuss
mailing list