[Lustre-discuss] Interpreting stats files
Mohr Jr, Richard Frank (Rick Mohr)
rmohr at utk.edu
Mon Nov 10 11:09:42 PST 2014
On Nov 10, 2014, at 1:14 PM, Brock Palen <brockp at umich.edu>
> This is cool never seen it before!
> Question though, is it really per job? Or is it per node combining multi node jobs into one set of stats?
> In our case we allow multiple jobs on a node, would job A and job B on the same node each have their own stats? Or will their stats overlap?
I believe each job on the same node should have their own stats. If I am not mistaken, the jobstats feature is basically just tagging the requests with some user-defined string (which in this case is the contents of an env variable). When the requests reach the servers, all requests with the same "tag" get aggregated together.
Keep in mind that each MDT/OST has their own jobstats file, so if you want to see stats on all the Lustre requests for a given job, you will need to pull those stats from each MDT/OST and aggregate the data. You may also want to tweak the auto-cleanup interval. By default, this is 10 minutes. So if a job is busy computing and doesn't do I/O for more than 10 minutes, the Lustre servers may automatically clean out that job's stat info (which might not be what you want it to do).
One other tip: The examples in the lustre manual that show how to enable jobstats often use the "lctl conf_param" command. This will cause all clients to use the same env variable for reporting jobstats. However, it can be useful to customize this based on the client's functionality. For example, you can use "lctl set_param jobid_var=PBS_JOBID" on compute nodes so that they report stats on a per-job basis. Then you can use "lctl set_param jobid_var=procname_uid" on login nodes to reports stats based on process name and UID. Then if your MDT gets slammed, you should be able to easily tell if the traffic is coming from a batch job or a user running an interactive command. And if it is an interactive command, you will have the process name and the user's UID. (I was able to use this to track down a Lustre client that was slamming our MDT because it was misconfigured and trying to index our Lustre file system for the "locate" command database.)
Senior HPC System Administrator
National Institute for Computational Sciences
More information about the lustre-discuss