[lustre-discuss] NRS TBF by UID and congestion

Moreno Diego (ID SIS) diego.moreno at id.ethz.ch
Mon Oct 18 07:55:30 PDT 2021


Salut Stephane!

Thanks a lot for this. I guess this is the kind of helpful answer I was looking for when I posted. All in all it seems we will need to find the right value that works for us. I have the impression that changing also the settings in the middle of a very high load might not be the best idea since the queues are already filled. We see some kind of blocked filesystem during some minutes after we enable it but afterwards it seems to work better. Have you also tried to enable it on LDLM services? I was advised in the past to never enable any kind of throttling on LDLM so locks are cancelled as fast as possible, otherwise we would have high CPU and memory usage on the MDS side.

I agree that it would be very useful to know which users have long waiting queues, this could eventually help to create dynamic and more complex rules for throttling.

Regards,

Diego
 

On 15.10.21, 09:13, "Stephane Thiell" <sthiell at stanford.edu> wrote:

    Salut Diego!

    Yes, we have been using NRS TBF by UID on our Oak storage system for months now with Lustre 2.12. It’s a capacity-oriented, global filesystem, not designed for heavy workloads (unlike our scratch filesystem) but with many users and as such, a great candidate for NRS TBF UID. Since NRS, we have seen WAY fewer occurrences of single users abusing the system (which is always by mistake so we’re helping them too!). We use NRS TBF UID for all Lustre services on MDS and OSS.

    We have an “exemption" rule for "root {0}" at 10000, and a default rule "default {*}” at a certain value. This value is per user and per CPT (it’s also a value per lustre service on the MDS for example, eg. mdt_readpage is a separate service). If you have large servers with many CPTs and set the value to 500, that’s 500 req/s per CPT per user, so perhaps it is still too high to be useful. The ideal value also probably depends on your default striping or other specifics.

    To set the NRS rate values right for the system, our approach is to monitor the active/queued values taken from the ’tbf uid’ policy on each OSS with lctl get_param ost.OSS.ost_io.nrs_tbf_rule (same thing on MDS for each mdt service). We record these instant gauge-like values every minute, which seems to be enough to see trends. The ‘queued' number is the most useful to me as I can easily see the impact of the rule by looking at the graph. Graphing these metrics over time allows us to adjust the rates so that queueing is not the norm, but the exception, while limiting heavy workloads.

    So it’s working for us on this system, the only thing now is that we would love to have a way to get additional NRS stats from Lustre, for example, the UIDs that have reached the rate limit over a period.

    Lastly, we tried to implement it on our scratch filesystem, but it’s more difficult. If a user has heavy duty jobs running on compute nodes and hit the rate limit, the user basically cannot transfer anything from a DTN or a login node (and will complain). I’ve opened LU-14567 to discuss wildcard support for “uid" in NRS TBF policy (’tbf’ and not ’tbf uid’) rules so that we could mix other, non-UID TBF rules with UID TBF rules. I don’t know how hard it is to implement.

    Hope that helps,

    Stephane


    > On Oct 14, 2021, at 12:33 PM, Moreno Diego (ID SIS) <diego.moreno at id.ethz.ch> wrote:
    > 
    > Hi Lustre friends,
    > 
    > I'm wondering if someone has experience setting NRS TBF (by UID) on the OSTs (ost_io and ost service) in order to avoid congestion of the filesystem IOPS or bandwidth. All my tries during the last months have miserably failed into something that doesn’t look like QoS when the system has a high load. Once the system is under high load not even the TBF UID policy is saving us from slow response times for any user. So far, I have only tried setting it by UID so every user has their fair share of bandwidth. I tried different rate values for the default rule (5'000, 1'000 or 500). We have Lustre 2.12 in our cluster.
    > 
    > Maybe there's any other setting that needs throttling (I see a parameter /sys/module/ptlrpc/parameters/tbf_rate that I could not find documented set to 10'000), is there anything I'm missing about this feature?
    > 
    > Regards,
    > 
    > Diego
    > 
    > 
    > _______________________________________________
    > lustre-discuss mailing list
    > lustre-discuss at lists.lustre.org
    > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




More information about the lustre-discuss mailing list