[lustre-discuss] good ways to identify clients causing problems?
adilger at whamcloud.com
Tue May 4 12:31:34 PDT 2021
On May 4, 2021, at 12:41, Bill Anderson via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:
Can you recommend good ways to identify Lustre client hosts that might be causing stability or performance problems for the entire filesystem?
For example, if a user is inadvertently doing something that's creating an RPC storm, what are good ways to identify the client host that has triggered the storm?
If you have a JobID enabled on the clients (which can be done even if they are not batch scheduled, like "procname_uid" for login nodes), then you can watch "lctl get_param *.*.job_stats | grep -v ' 0, unit:'" (to filter out unused stats) to see if there are *jobs* which put a high RPC load on that server.
If you are looking for a particular *client* you can look at "lctl get_param *.*.exports.*.stats" to see if any are driving a lot of RPCs, possibly after clearing those stats with "lctl set_param *.*.exports.*.stats=0".
If you feel inclined, it would be quite useful to add a mode to the "llstat" utility to be able to read and aggregate stats from e.g. all the "exports.*.stats" files and show the top users by NID and RPC count. I think several people have made scripts to this effect (you might even find some on Github), but nobody has ever submitted it to be included into the repo for everyone to use. There are more elaborate monitoring systems (e.g. IML, lltop, Graphana that need agents installed, central monitoring, etc.), but having a simple "check load on the local node like 'top'" tool would still be helpful.
Principal Lustre Architect
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the lustre-discuss