[lustre-discuss] Spiking OSS load?

Mohr Jr, Richard Frank (Rick Mohr) rmohr at utk.edu
Thu Aug 3 09:41:49 PDT 2017


> On Aug 1, 2017, at 3:07 PM, Jason Williams <jasonw at jhu.edu> wrote:

> 1)      Is 512 threads a reasonable setting or should it be lower?

Since your servers have enough memory to support 512 threads, then it is probably reasonable.  If your server load is ~100, that probably means most of those threads are sitting idle (which should be fine). I think you would only need to lower the value if you saw that all the threads were routinely busy and there was some evidence that having that many busy threads was causing an issue on your server.

> 2)      Is high load “normal” if the file system is under heavy use?  At the time I see a lot of open and attr calls which I thought would load the MDS over the OSS… but my under-the-hood understanding is limited at best.

A high load might very well be normal for your file system.  As you have seen, lots of requests can result in threads sitting in IO wait states which causes the load to increase.  On my servers, I don’t usually bat an eye if I see loads over 100.  However, there are still a few things you should probably look out for:

- If the storage is processing requests quickly, but there are more incoming requests than it can handle, the load will go up (which is normal).  But if the IO requests are getting backlogged because the storage is not handling requests as fast as it should, then that is a problem.  Running iostat should give you an idea which case you are running into.

- If the load starts approaching the number of ost threads, then you could be getting into a state where the server cannot accept any more incoming requests.

> 3)      Should I be looking at other tunables?

You could double check that read/write caching is enabled (which I think it is be default in Lustre 2.5).

One thing I would recommend would be to take a look at the brw_stats for the OSTs to see what sizes of IO requests you are getting.  If there are lots of small read/writes, this can cause IO requests to back up and drive up the load which in turn can cause adverse performance problems.  I don’t know what kinds of codes your users run, so I don’t know what their IO patterns are like.  When I see very high loads on my servers, I usually check to see if there are lots of small IO requests from a single user.  This can sometimes be an indication that their code is performing IO in a suboptimal manner which is negatively impacting the file system.  We have had a fair amount of success working with users to improve their IO patterns.  This not only helps alleviate load on our servers, but it also increases performance for other users.  In some cases, just having the user restripe a file can dramatically reduce load.

So in summary:

Q: Is it a problem to have a high load on my OSS servers?
A: It depends….

(Wish it could be a little more clear cut than that)

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu



More information about the lustre-discuss mailing list