[lustre-discuss] [EXTERNAL] good ways to identify clients causing problems?
mohrrf at ornl.gov
Fri May 28 14:20:52 PDT 2021
One option I have used in the past is to look at the rpc request history. For example, on an oss server, you can run:
lctl get_param ost.OSS.ost_io.req_history
and then extract the client nid for each request. Based on that, you can calculate the number of requests coming into the server and look for any clients that are significantly higher than the others. Maybe something like:
lctl get_param ost.OSS.ost_io.req_history | cut -d: -f3 | sort | uniq -c | sort -n
I have used that approach in the past to identify misbehaving clients (the number of requests from such clients was usually one or two orders of magnitude higher than the others). If multiple clients are unusually high, you may be able to correlate the nodes with currently running jobs to identify a particular job (assuming you don't already have lustre job stats enabled).
On 5/4/21, 2:41 PM, "lustre-discuss on behalf of Bill Anderson via lustre-discuss" <lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss at lists.lustre.org> wrote:
Can you recommend good ways to identify Lustre client hosts that might be causing stability or performance problems for the entire filesystem?
For example, if a user is inadvertently doing something that's creating an RPC storm, what are good ways to identify the client host that has triggered the storm?
More information about the lustre-discuss