[lustre-discuss] good ways to identify clients causing problems?

Tue May 4 12:47:57 PDT 2021

    Thank you so much!

On Tue, May 4, 2021 at 1:31 PM Andreas Dilger <adilger at whamcloud.com> wrote:

> On May 4, 2021, at 12:41, Bill Anderson via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
>
>
>    Hi All,
>
>    Can you recommend good ways to identify Lustre client hosts that might
> be causing stability or performance problems for the entire filesystem?
>
>    For example, if a user is inadvertently doing something that's creating
> an RPC storm, what are good ways to identify the client host that has
> triggered the storm?
>
>
> If you have a JobID enabled on the clients (which can be done even if they
> are not batch scheduled, like "procname_uid" for login nodes), then you can
> watch "lctl get_param *.*.job_stats | grep -v ' 0, unit:'" (to filter out
> unused stats) to see if there are *jobs* which put a high RPC load on that
> server.
>
> If you are looking for a particular *client* you can look at "lctl
> get_param *.*.exports.*.stats" to see if any are driving a lot of RPCs,
> possibly after clearing those stats with "lctl set_param
> *.*.exports.*.stats=0".
>
> If you feel inclined, it would be quite useful to add a mode to the
> "llstat" utility to be able to read and aggregate stats from e.g. all the
> "exports.*.stats" files and show the top users by NID and RPC count.  I
> think several people have made scripts to this effect (you might even find
> some on Github), but nobody has ever submitted it to be included into the
> repo for everyone to use.  There are more elaborate monitoring systems
> (e.g. IML, lltop, Graphana that need agents installed, central monitoring,
> etc.), but having a simple "check load on the local node like 'top'" tool
> would still be helpful.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Lustre Architect
> Whamcloud
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20210504/3ce56e36/attachment.html>