[lustre-discuss] Shrinking grant with 2.12 clients

Simon Guilbault simon.guilbault at calculquebec.ca
Mon Mar 30 10:43:16 PDT 2020


Hi,

We seem to be hitting a performance issue with Lustre clients 2.12.2 and
2.12.3. Over time, the grant size of the OSC is shrinking and getting under
1MB and does not grow back. This lowers the performance of this client to a
few MB/s, even in the kB/s for some OST. This does not seem to happen on
2.10.8 clients since they don’t have the “grant_shrink” flag. The servers
are running 2.12.3 with ZFS 0.7.9.

Here is what we can see as performance per OST with a simple dd test, the
worst OST is #5 with 222 kB/s. A client with 2.10 on the same OST is
reaching > 800MB/s.

for i in {0..37}; do lfs setstripe --ost $i --stripe-count 1 ost$i ; done

for i in {0..37}; do dd if=/dev/zero of=ost$i bs=1M count=100; done

104857600 bytes (105 MB) copied, 0.142473 s, 736 MB/s

104857600 bytes (105 MB) copied, 9.22021 s, 11.4 MB/s

104857600 bytes (105 MB) copied, 0.0905684 s, 1.2 GB/s

104857600 bytes (105 MB) copied, 6.36873 s, 16.5 MB/s

104857600 bytes (105 MB) copied, 0.0929602 s, 1.1 GB/s

104857600 bytes (105 MB) copied, 471.699 s, 222 kB/s

104857600 bytes (105 MB) copied, 0.177067 s, 592 MB/s

[...]


As an example, this slow client have a grant_size of 0.8MB after being up
for a while:

lctl get_param osc.lustre04-OST0005*.cur_grant_bytes

osc.lustre04-OST0005-osc-ffff98128d818000.cur_grant_bytes=883028

In the debug logs, I can see a request sent as sync IO since the grant size
is now too small to contain the 1.7MB request

00000008:00000020:10.0:1585145743.107840:0:116122:0:(osc_cache.c:1590:osc_enter_cache())
lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936

00000008:00000020:10.0:1585145743.107842:0:116122:0:(osc_cache.c:1539:osc_enter_cache_try())
lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }need:1703936

00000008:00000020:10.0:1585145743.107843:0:116122:0:(osc_cache.c:1666:osc_enter_cache())
lustre04-OST0005-osc-ffff98128d818000: grant { dirty: 0/512000 dirty_pages:
448/24562964 dropped: 0 avail: 883028, dirty_grant: 0, reserved: 0, flight:
0 } lru {in list: 146368, left: 64, waiters: 0 }no grant space, fall back
to sync i/o

There is currently 30GB granted on a OST with about 22TB free.

[root at lustre04-oss1 ~]# lctl get_param
obdfilter/lustre04-OST0005/tot_granted

obdfilter.lustre04-OST0005.tot_granted=30257446912

Somehow, the client does not receive a bigger grant, so it seems to stay
forever under 1MB.

00000008:00000020:4.0:1585145743.107950:0:22701:0:(osc_request.c:705:osc_announce_cached())
dirty: 0 undirty: 2080374783 dropped 0 grant: 883028

00000008:00000020:14.0:1585145743.236923:0:22702:0:(osc_request.c:727:osc_update_grant())
got 0 extra grant

Is this a known issue ? I could not find a similar ticket in JIRA, but I do
see some references to disabling grant_shrink in LU-12651 and LU-12759.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20200330/fcda2c7d/attachment-0001.html>


More information about the lustre-discuss mailing list