[lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

Thu Jan 20 05:14:25 PST 2022

Ellis, I would also check the peer_credit between server and the client.
They should be same.

On Wed, Jan 19, 2022 at 9:27 AM Patrick Farrell via lustre-discuss <
lustre-discuss at lists.lustre.org> wrote:

> Ellis,
>
> As you may have guessed, that function just set looks like a node which is
> doing buffered I/O and thrashing for memory.  No particular insight
> available from the count of functions there.
>
> Would you consider opening a bug report in the Whamcloud JIRA?  You should
> have enough for a good report, here's a few things that would be helpful as
> well:
>
> It sounds like you can hang the node on demand.  If you could collect
> stack traces with:
>
> echo t > /proc/sysrq-trigger
>
> after creating the hang, that would be useful.  (It will print to dmesg.)
>
> You've also collected debug logs - Could you include, say, the last 100
> MiB of that log set?  That should be reasonable to attach if compressed.
>
> Regards,
> Patrick
>
> ------------------------------
> *From:* lustre-discuss <lustre-discuss-bounces at lists.lustre.org> on
> behalf of Ellis Wilson via lustre-discuss <lustre-discuss at lists.lustre.org
> >
> *Sent:* Wednesday, January 19, 2022 8:32 AM
> *To:* Andreas Dilger <adilger at whamcloud.com>
> *Cc:* lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org>
> *Subject:* Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O
> (2.14/2.15)
>
>
> Hi Andreas,
>
>
>
> Apologies in advance for the top-post.  I’m required to use Outlook for
> work, and it doesn’t handle in-line or bottom-posting well.
>
>
>
> Client-side defaults prior to any tuning of mine (this is a very minimal
> 1-client, 1-MDS/MGS, 2-OSS cluster):
>
>
> ~# lctl get_param llite.*.max_cached_mb
>
> llite.lustrefs-ffff8d52a9c52800.max_cached_mb=
>
> users: 5
>
> max_cached_mb: 7748
>
> used_mb: 0
>
> unused_mb: 7748
>
> reclaim_count: 0
>
> ~# lctl get_param osc.*.max_dirty_mb
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=1938
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=1938
>
> ~# lctl get_param osc.*.max_rpcs_in_flight
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=8
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=8
>
> ~# lctl get_param osc.*.max_pages_per_rpc
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=1024
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=1024
>
>
>
> Thus far I’ve reduced the following to what I felt were really
> conservative values for a 16GB RAM machine:
>
>
>
> ~# lctl set_param llite.*.max_cached_mb=1024
>
> llite.lustrefs-ffff8d52a9c52800.max_cached_mb=1024
>
> ~# lctl set_param osc.*.max_dirty_mb=512
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_dirty_mb=512
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_dirty_mb=512
>
> ~# lctl set_param osc.*.max_pages_per_rpc=128
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_pages_per_rpc=128
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_pages_per_rpc=128
>
> ~# lctl set_param osc.*.max_rpcs_in_flight=2
>
> osc.lustrefs-OST0000-osc-ffff8d52a9c52800.max_rpcs_in_flight=2
>
> osc.lustrefs-OST0001-osc-ffff8d52a9c52800.max_rpcs_in_flight=2
>
>
>
> This slows down how fast I get to basically OOM from <10 seconds to more
> like 25 seconds, but the trend is identical.
>
>
>
> As an example of what I’m seeing on the client, you can see below we start
> with most free, and then iozone rapidly (within ~10 seconds) causes all
> memory to be marked used, and that stabilizes at about 140MB free until at
> some point it stalls for 20 or more seconds and then some has been synced
> out:
>
>
> ~# dstat --mem
>
> ------memory-usage-----
>
> used  free  buff  cach
>
> 1029M 13.9G 2756k  215M
>
> 1028M 13.9G 2756k  215M
>
> 1028M 13.9G 2756k  215M
>
> 1088M 13.9G 2756k  215M
>
> 2550M 11.5G 2764k 1238M
>
> 3989M 10.1G 2764k 1236M
>
> 5404M 8881M 2764k 1239M
>
> 6831M 7453M 2772k 1240M
>
> 8254M 6033M 2772k 1237M
>
> 9672M 4613M 2772k 1239M
>
> 10.6G 3462M 2772k 1240M
>
> 12.1G 1902M 2772k 1240M
>
> 13.4G  582M 2772k 1240M
>
> 13.9G  139M 2488k 1161M
>
> 13.9G  139M 1528k 1174M
>
> 13.9G  140M  896k 1175M
>
> 13.9G  139M  676k 1176M
>
> 13.9G  142M  528k 1177M
>
> 13.9G  140M  484k 1188M
>
> 13.9G  139M  492k 1188M
>
> 13.9G  139M  488k 1188M
>
> 13.9G  141M  488k 1186M
>
> 13.9G  141M  480k 1187M
>
> 13.9G  139M  492k 1188M
>
> 13.9G  141M  600k 1188M
>
> 13.9G  139M  580k 1187M
>
> 13.9G  140M  536k 1186M
>
> 13.9G  141M  668k 1186M
>
> 13.9G  139M  580k 1188M
>
> 13.9G  140M  568k 1187M
>
> 12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive
> during this time
>
> 11.0G 2972M 5404k 1238M^C
>
>
>
> Additionally, I’ve messed with sysctl settings.  Defaults:
>
> vm.dirty_background_bytes = 0
>
> vm.dirty_background_ratio = 10
>
> vm.dirty_bytes = 0
>
> vm.dirty_expire_centisecs = 3000
>
> vm.dirty_ratio = 20
>
> vm.dirty_writeback_centisecs = 500
>
>
>
> Revised to conservative values:
>
> vm.dirty_background_bytes = 1073741824
>
> vm.dirty_background_ratio = 0
>
> vm.dirty_bytes = 2147483648
>
> vm.dirty_expire_centisecs = 200
>
> vm.dirty_ratio = 0
>
> vm.dirty_writeback_centisecs = 500
>
>
>
> No observed improvement.
>
>
>
> I’m going to trawl two logs today side-by-side, one with ldiskfs backing
> the OSTs, and one with zfs backing the OSTs, and see if I can see what the
> differences are since the zfs-backed version never gave us this problem.
> The only other potentially useful thing I can share right now is that when
> I turned on full debug logging and ran the test until I hit OOM, the
> following were the most frequently hit functions in the logs (count,
> descending, is the first column).  This was approximately 30s of logs:
>
>
>  205874 cl_page.c:518:cl_vmpage_page())
>
>  206587 cl_page.c:545:cl_page_owner_clear())
>
>  206673 cl_page.c:551:cl_page_owner_clear())
>
>  206748 osc_cache.c:2483:osc_teardown_async_page())
>
>  206815 cl_page.c:867:cl_page_delete())
>
>  206862 cl_page.c:837:cl_page_delete0())
>
>  206878 osc_cache.c:2478:osc_teardown_async_page())
>
>  206928 cl_page.c:869:cl_page_delete())
>
>  206930 cl_page.c:441:cl_page_state_set0())
>
>  206988 osc_page.c:206:osc_page_delete())
>
>  207021 cl_page.c:179:__cl_page_free())
>
>  207021 cl_page.c:193:cl_page_free())
>
>  207021 cl_page.c:532:cl_vmpage_page())
>
>  207024 cl_page.c:210:cl_page_free())
>
>  207075 cl_page.c:430:cl_page_state_set0())
>
>  207169 osc_cache.c:2505:osc_teardown_async_page())
>
>  207175 cl_page.c:475:cl_pagevec_put())
>
>  207202 cl_page.c:492:cl_pagevec_put())
>
>  207211 cl_page.c:822:cl_page_delete0())
>
>  207384 osc_page.c:178:osc_page_delete())
>
>  207422 osc_page.c:177:osc_page_delete())
>
>  413680 cl_page.c:433:cl_page_state_set0())
>
>  413701 cl_page.c:477:cl_pagevec_put())
>
>
>
> If anybody has any additional suggestions or requests for more info don’t
> hesitate to ask.
>
>
>
> Best,
>
>
>
> ellis
>
>
>
> *From:* Andreas Dilger <adilger at whamcloud.com>
> *Sent:* Tuesday, January 18, 2022 9:54 PM
> *To:* Ellis Wilson <elliswilson at microsoft.com>
> *Cc:* lustre-discuss at lists.lustre.org
> *Subject:* [EXTERNAL] Re: [lustre-discuss] Lustre Client Lockup Under
> Buffered I/O (2.14/2.15)
>
>
>
> You don't often get email from adilger at whamcloud.com. Learn why this is
> important <http://aka.ms/LearnAboutSenderIdentification>
>
> On Jan 18, 2022, at 13:40, Ellis Wilson via lustre-discuss <
> lustre-discuss at lists.lustre.org> wrote:
>
>
>
> Recently we've switched from using ZFS to ldiskfs as the backing
> filesystem to work around some performance issues and I'm finding that when
> I put the cluster under load (with as little as a single client) I can
> almost completely lockup the client.  SSH (even existing sessions) stall,
> iostat, top, etc all freeze for 20 to 200 seconds.  This alleviates for
> small windows and recurs as long as I leave the io-generating process in
> existence.  It reports extremely high CPU and RAM usage, and appears to be
> consumed exclusively doing 'system'-tagged work.  This is on 2.14.0, but
> I've reproduced on more or less HOL for master-next.  If I do direct-IO,
> performance is fantastic and I have no such issues regarding CPU/memory
> pressure.
>
> Uname: Linux 85df894e-8458-4aa4-b16f-1d47154c0dd2-lclient-a0-g0-vm
> 5.4.0-1065-azure #68~18.04.1-Ubuntu SMP Fri Dec 3 14:08:44 UTC 2021 x86_64
> x86_64 x86_64 GNU/Linux
>
> I dmesg I see consistent spew on the client about:
> [19548.601651] LustreError: 30918:0:(events.c:208:client_bulk_callback())
> event type 1, status -5, desc 00000000b69b83b0
> [19548.662647] LustreError: 30917:0:(events.c:208:client_bulk_callback())
> event type 1, status -5, desc 000000009ef2fc22
> [19549.153590] Lustre: lustrefs-OST0000-osc-ffff8d52a9c52800: Connection
> to lustrefs-OST0000 (at 10.1.98.7 at tcp) was lost; in progress operations
> using this service will wait for recovery to complete
> [19549.153621] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request())
> @@@ Request sent has failed due to network error: [sent 1642535831/real
> 1642535833]  req at 0000000002361e2d x1722317313374336/t0(0) o4->
> lustrefs-OST0001-osc-ffff8d52a9c52800 at 10.1.98.10@tcp:6/4 lens 488/448 e 0
> to 1 dl 1642535883 ref 2 fl Rpc:eXQr/0/ffffffff rc 0/-1 job:''
> [19549.153623] Lustre: 30927:0:(client.c:2282:ptlrpc_expire_one_request())
> Skipped 4 previous similar messages
>
> But I actually think this is a symptom of extreme memory pressure causing
> the client to timeout things, not a cause.
>
> Testing with obdfilter-survey (local) on the OSS side shows expected
> performance of the disk subsystem.  Testing with lnet_selftest from client
> to OSS shows expected performance.  In neither case do I see the high cpu
> or memory pressure issues.
>
> Reducing a variety of lctl tunables that appear to govern memory
> allowances for Lustre clients does not improve the situation.
>
>
>
> What have you reduced here?  llite.*.max_cached_mb, osc.*.max_dirty_mb,
> osc.*.max_rpcs_in_flight and osc.*.max_pages_per_rpc?
>
>
>
> By all appearances, the running iozone or even simple dd processes
> gradually (i.e., over a span of just 10 seconds or so) consumes all 16GB of
> RAM on the client I'm using.  I've generated bcc profile graphs for both
> on- and off-cpu analysis, and they are utterly boring -- they basically
> just reflect rampant calls to shrink_inactive_list resulting from
> page_cache_alloc in the presence of extreme memory pressure.
>
>
>
> We have seen some issues like this that are being looked at, but this is
> mostly only seen on smaller VM clients used in testing and not larger
> production clients.  Are you able to test with more RAM on the client?
> Have you tried with 2.12.8 installed on the client?
>
>
>
> Cheers, Andreas
>
> --
>
> Andreas Dilger
>
> Lustre Principal Architect
>
> Whamcloud
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss at lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/attachments/20220120/11d08cc1/attachment-0001.html>