[lustre-discuss] good ways to identify clients causing problems?
andersnb at ucar.edu
Tue May 4 12:47:57 PDT 2021
Thank you so much!
On Tue, May 4, 2021 at 1:31 PM Andreas Dilger <adilger at whamcloud.com> wrote:
> On May 4, 2021, at 12:41, Bill Anderson via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
> Hi All,
> Can you recommend good ways to identify Lustre client hosts that might
> be causing stability or performance problems for the entire filesystem?
> For example, if a user is inadvertently doing something that's creating
> an RPC storm, what are good ways to identify the client host that has
> triggered the storm?
> If you have a JobID enabled on the clients (which can be done even if they
> are not batch scheduled, like "procname_uid" for login nodes), then you can
> watch "lctl get_param *.*.job_stats | grep -v ' 0, unit:'" (to filter out
> unused stats) to see if there are *jobs* which put a high RPC load on that
> If you are looking for a particular *client* you can look at "lctl
> get_param *.*.exports.*.stats" to see if any are driving a lot of RPCs,
> possibly after clearing those stats with "lctl set_param
> If you feel inclined, it would be quite useful to add a mode to the
> "llstat" utility to be able to read and aggregate stats from e.g. all the
> "exports.*.stats" files and show the top users by NID and RPC count. I
> think several people have made scripts to this effect (you might even find
> some on Github), but nobody has ever submitted it to be included into the
> repo for everyone to use. There are more elaborate monitoring systems
> (e.g. IML, lltop, Graphana that need agents installed, central monitoring,
> etc.), but having a simple "check load on the local node like 'top'" tool
> would still be helpful.
> Cheers, Andreas
> Andreas Dilger
> Principal Lustre Architect
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss