[lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

Wed Jan 19 06:32:52 PST 2022

Hi Andreas,

Apologies in advance for the top-post.  I'm required to use Outlook for work, and it doesn't handle in-line or bottom-posting well.

Client-side defaults prior to any tuning of mine (this is a very minimal 1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb
llite.lustrefs-ffff8d52a9c52800.max_cached_mb=
users: 5
max_cached_mb: 7748
used_mb: 0
unused_mb: 7748
reclaim_count: 0
~# lctl get_param osc.*.max_dirty_mb
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=1938
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=1938
~# lctl get_param osc.*.max_rpcs_in_flight
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=8
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=8
~# lctl get_param osc.*.max_pages_per_rpc
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=1024
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=1024

Thus far I've reduced the following to what I felt were really conservative values for a 16GB RAM machine:

~# lctl set_param llite.*.max_cached_mb=1024
llite.lustrefs-ffff8d52a9c52800.max_cached_mb=1024
~# lctl set_param osc.*.max_dirty_mb=512
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=512
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=512
~# lctl set_param osc.*.max_pages_per_rpc=128
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=128
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=128
~# lctl set_param osc.*.max_rpcs_in_flight=2
osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=2
osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=2

This slows down how fast I get to basically OOM from <10 seconds to more like 25 seconds, but the trend is identical.

As an example of what I'm seeing on the client, you can see below we start with most free, and then iozone rapidly (within ~10 seconds) causes all memory to be marked used, and that stabilizes at about 140MB free until at some point it stalls for 20 or more seconds and then some has been synced out:

~# dstat --mem
------memory-usage-----
used  free  buff  cach
1029M 13.9G 2756k  215M
1028M 13.9G 2756k  215M
1028M 13.9G 2756k  215M
1088M 13.9G 2756k  215M
2550M 11.5G 2764k 1238M
3989M 10.1G 2764k 1236M
5404M 8881M 2764k 1239M
6831M 7453M 2772k 1240M
8254M 6033M 2772k 1237M
9672M 4613M 2772k 1239M
10.6G 3462M 2772k 1240M
12.1G 1902M 2772k 1240M
13.4G  582M 2772k 1240M
13.9G  139M 2488k 1161M
13.9G  139M 1528k 1174M
13.9G  140M  896k 1175M
13.9G  139M  676k 1176M
13.9G  142M  528k 1177M
13.9G  140M  484k 1188M
13.9G  139M  492k 1188M
13.9G  139M  488k 1188M
13.9G  141M  488k 1186M
13.9G  141M  480k 1187M
13.9G  139M  492k 1188M
13.9G  141M  600k 1188M
13.9G  139M  580k 1187M
13.9G  140M  536k 1186M
13.9G  141M  668k 1186M
13.9G  139M  580k 1188M
13.9G  140M  568k 1187M
12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive during this time
11.0G 2972M 5404k 1238M^C

Additionally, I've messed with sysctl settings.  Defaults:
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500

Revised to conservative values:
vm.dirty_background_bytes = 1073741824
vm.dirty_background_ratio = 0
vm.dirty_bytes = 2147483648
vm.dirty_expire_centisecs = 200
vm.dirty_ratio = 0
vm.dirty_writeback_centisecs = 500

No observed improvement.

I'm going to trawl two logs today side-by-side, one with ldiskfs backing the OSTs, and one with zfs backing the OSTs, and see if I can see what the differences are since the zfs-backed version never gave us this problem.  The only other potentially useful thing I can share right now is that when I turned on full debug logging and ran the test until I hit OOM, the following were the most frequently hit functions in the logs (count, descending, is the first column).  This was approximately 30s of logs:

 205874 cl_page.c:518:cl_vmpage_page())
 206587 cl_page.c:545:cl_page_owner_clear())
 206673 cl_page.c:551:cl_page_owner_clear())
 206748 osc_cache.c:2483:osc_teardown_async_page())
 206815 cl_page.c:867:cl_page_delete())
 206862 cl_page.c:837:cl_page_delete0())
 206878 osc_cache.c:2478:osc_teardown_async_page())
 206928 cl_page.c:869:cl_page_delete())
 206930 cl_page.c:441:cl_page_state_set0())
 206988 osc_page.c:206:osc_page_delete())
 207021 cl_page.c:179:__cl_page_free())
 207021 cl_page.c:193:cl_page_free())
 207021 cl_page.c:532:cl_vmpage_page())
 207024 cl_page.c:210:cl_page_free())
 207075 cl_page.c:430:cl_page_state_set0())
 207169 osc_cache.c:2505:osc_teardown_async_page())
 207175 cl_page.c:475:cl_pagevec_put())
 207202 cl_page.c:492:cl_pagevec_put())
 207211 cl_page.c:822:cl_page_delete0())
 207384 osc_page.c:178:osc_page_delete())
 207422 osc_page.c:177:osc_page_delete())
 413680 cl_page.c:433:cl_page_state_set0())
 413701 cl_page.c:477:cl_pagevec_put())

If anybody has any additional suggestions or requests for more info don't hesitate to ask.

Best,

ellis

From: Andreas Dilger <adilger at whamcloud.com>
Sent: Tuesday, January 18, 2022 9:54 PM
To: Ellis Wilson <elliswilson at microsoft.com>
Cc: lustre-discuss at lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

You don't often get email from adilger at whamcloud.com<mailto:adilger at whamcloud.com>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification>
On Jan 18, 2022, at 13:40, Ellis Wilson via lustre-discuss <lustre-discuss at lists.lustre.org<mailto:lustre-discuss at lists.lustre.org>> wrote:

Recently we've switched from using ZFS to ldiskfs as the backing filesystem to work around some performance issues and I'm finding that when I put the cluster under load (with as little as a single client) I can almost completely lockup the client.  SSH (even existing sessions) stall, iostat, top, etc all freeze for 20 to 200 seconds.  This alleviates for small windows and recurs as long as I leave the io-generating process in existence.  It reports extremely high CPU and RAM usage, and appears to be consumed exclusively doing 'system'-tagged work.  This is on 2.14.0, but I've reproduced on more or less HOL for master-next.  If I do direct-IO, performance is fantastic and I have no such issues regarding CPU/memory pressure.

Uname: Linux 85df894e-8458-4aa4-b16f-1d47154c0dd2-lclient-a0-g0-vm 5.4.0-1065-azure #68~18.04.1-Ubuntu SMP Fri Dec 3 14:08:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I dmesg I see consistent spew on the client about:
[19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 00000000b69b83b0
[19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback()) event type 1, status -5, desc 000000009ef2fc22
[19549.153590] Lustre: lustrefs-OST0000-osc-ffff8d52a9c52800: Connection to lustrefs-OST0000 (at 10.1.98.7 at tcp<mailto:10.1.98.7 at tcp>) was lost; in progress operations using this service will wait for recovery to complete
[19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1642535831/real 1642535833]  req at 0000000002361e2d x1722317313374336/t0(0) o4->lustrefs-OST0001-osc-ffff8d52a9c52800 at 10.1.98.10<mailto:lustrefs-OST0001-osc-ffff8d52a9c52800 at 10.1.98.10>@tcp:6/4 lens 488/448 e 0 to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
[19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request()) Skipped 4 previous similar messages

But I actually think this is a symptom of extreme memory pressure causing the client to timeout things, not a cause.

Testing with obdfilter-survey (local) on the OSS side shows expected performance of the disk subsystem.  Testing with lnet_selftest from client to OSS shows expected performance.  In neither case do I see the high cpu or memory pressure issues.

Reducing a variety of lctl tunables that appear to govern memory allowances for Lustre clients does not improve the situation.

What have you reduced here?  llite.*.max_cached_mb, osc.*.max_dirty_mb, osc.*.max_rpcs_in_flight and osc.*.max_pages_per_rpc?

By all appearances, the running iozone or even simple dd processes gradually (i.e., over a span of just 10 seconds or so) consumes all 16GB of RAM on the client I'm using.  I've generated bcc profile graphs for both on- and off-cpu analysis, and they are utterly boring -- they basically just reflect rampant calls to shrink_inactive_list resulting from page_cache_alloc in the presence of extreme memory pressure.

We have seen some issues like this that are being looked at, but this is mostly only seen on smaller VM clients used in testing and not larger production clients.  Are you able to test with more RAM on the client?  Have you tried with 2.12.8 installed on the client?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220119/5d7bc968/attachment-0001.html>