[lustre-discuss] More issues with cur_grant_bytes

Nathan Dauchy - NOAA Affiliate nathan.dauchy at noaa.gov
Wed Dec 9 08:40:55 PST 2020


On Tue, Dec 8, 2020 at 11:46 AM Kevin M. Hildebrand <kevin at umd.edu> wrote:

> We appear to be tripping over the same issues reported recently by
> Tung-Han Hsieh and Simon Guilbault, namely that cur_grant_bytes is being
> reduced to a very small value and causing abysmal performance.
>

I'm curious if anyone encountering this problem sees a correlation between
cur_grant_bytes and brw_size.  For us, the max RPC size is 4MB, and that
also seems to be the threshold for cur_grant_bytes below which performance
degrades drastically.  Would it be reasonable for the grant shrinker to
never go below brw_size?

As for a "fix", we are monitoring for low cur_grant_bytes and draining work
off the nodes, at least to the point where we can do a set param
lru_size=clear and then confirm that get param lru_size=0, and then forcing
a reconnection and re-negotiation of the grant_bytes with the server with
"lctl --device <num> activate".

I too would be interested to know if there are downsides to setting
grant_shrink=0 on the clients, and whether that is confirmed to actually
avoid the problem.

Regards,
Nathan


>
> For example, OSTs 0, 1, and 4 are having poor performance on this client
> running Lustre 2.12.5:
> # lctl get_param osc.*.cur_grant_bytes
> osc.lustre10-OST0000-osc-ffff8e47dc02a800.cur_grant_bytes=802542
> osc.lustre10-OST0001-osc-ffff8e47dc02a800.cur_grant_bytes=924204
> osc.lustre10-OST0002-osc-ffff8e47dc02a800.cur_grant_bytes=11076653
> osc.lustre10-OST0003-osc-ffff8e47dc02a800.cur_grant_bytes=108098653
> osc.lustre10-OST0004-osc-ffff8e47dc02a800.cur_grant_bytes=797559
> osc.lustre10-OST0005-osc-ffff8e47dc02a800.cur_grant_bytes=4719258
> osc.lustre10-OST0006-osc-ffff8e47dc02a800.cur_grant_bytes=4898757
> osc.lustre10-OST0007-osc-ffff8e47dc02a800.cur_grant_bytes=10747719
> osc.lustre10-OST0008-osc-ffff8e47dc02a800.cur_grant_bytes=315019599
> osc.lustre10-OST0009-osc-ffff8e47dc02a800.cur_grant_bytes=597198336
> osc.lustre10-OST000a-osc-ffff8e47dc02a800.cur_grant_bytes=278803109
> osc.lustre10-OST000b-osc-ffff8e47dc02a800.cur_grant_bytes=1335800831
> osc.lustre10-OST000c-osc-ffff8e47dc02a800.cur_grant_bytes=795705344
> osc.lustre10-OST000d-osc-ffff8e47dc02a800.cur_grant_bytes=1335052800
> osc.lustre10-OST000e-osc-ffff8e47dc02a800.cur_grant_bytes=474925228
> osc.lustre10-OST000f-osc-ffff8e47dc02a800.cur_grant_bytes=1424795647
>
> From the previous discussion, the recommendation seems to have been to run
> lctl set_param -P osc.*.grant_shrink=0 on the client.  Are there any
> downsides to doing this?
> Should I just blindly do this on all of my Lustre clients?
>
> A little more insight into what's going on here would be appreciated.
>
> Thanks,
> Kevin
>
> --
> Kevin Hildebrand
> University of Maryland
> Division of IT
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20201209/d13c26e1/attachment.html>


More information about the lustre-discuss mailing list