<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

On Jan 21, 2021, at 11:32, James Simmons <<a href="mailto:jsimmons@casper.infradead.org" class="">jsimmons@casper.infradead.org</a>> wrote:<br class="">

<div>

<blockquote type="cite" class=""><br class="Apple-interchange-newline">

<div class="">

<div class=""> One of the challenging issues for very large scale file systems is the<br class="">

performance crash when you cross about 670 stripe count. This is due to<br class="">

the memory allocations going from kmalloc to vmalloc. Once you start to<br class="">

use vmalloc to allocate the ptlrpc message buffers all the allocating<br class="">

start to serialize on a global spinlock.<br class="">

 Looking for a solution the best one I found so far have been using the<br class="">

generic radix tree API. You have to allocate a page worth of data at a<br class="">

time so its cluncky for use but I think we could make it work. What do<br class="">

you think?<br class="">

<br class="">

<a href="https://www.kernel.org/doc/html/latest/core-api/generic-radix-tree.html" class="">https://www.kernel.org/doc/html/latest/core-api/generic-radix-tree.html</a><br class="">

</div>

</div>

</blockquote>

<br class="">

</div>

<div>I think the first thing to figure out here is whether vmalloc() of the reply buffer is</div>

<div>the problem for sure.  670 stripes is only a 16KB layout, which I'd think would be</div>

<div>handled by kmalloc() almost all of the time, unless memory is very fragmented.</div>

<div>It would also be worthwhile to see if the __GFP_ZERO of this allocation is a</div>

<div>source of performance problems?  While I wouldn't recommend to disable this</div>

<div>to start, at least checking if memset() shows up in the profile would be useful.</div>

<div><br class="">

</div>

<div>I definitely recall some strangeness in the MDT reply code for very large replies</div>

<div>that may mean it is retrying the message if it is too large on the first send.</div>

<div>There are probably some improvements to commit v2_5_56_0-71-g006f258300</div>

<div>"LU-3338 llite: Limit reply buffer size" to better chose the reply buffer size if large</div>

<div>layouts are commonly returned. I think it just bails and limits the layout buffer to</div>

<div>a max of PAGE_SIZE or similar, instead of having a better algorithm.</div>

<div><br class="">

</div>

<div><br class="">

</div>

<div>In times gone by, we also had a patch to improve vmalloc() performance, but</div>

<div>unfortunately they were rejected upstream because "we don't want developers</div>

<div>using vmalloc() and if the performance is bad they will avoid it", or similar.</div>

<div><br class="">

</div>

<div>Now that kvmalloc() is a widey-used interface in the code, maybe improving</div>

<div>vmalloc() performance is of interest again (i.e. removing the single global lock)?</div>

<div>One simple optimization was to linearly use the vmalloc address space from</div>

<div>start to end, instead of trying to have a "smart" usage of the address space.  It</div>

<div>is 32TiB in size, so takes a while to exhaust (probably several days under normal</div>

<div>usage), so the first pass through is "free".</div>

<div><br class="">

</div>

<div><br class="">

</div>

<div>It isn't clear to me what your goal with the radix tree is?  Do you intend to replace</div>

<div>vmalloc() usage in Lustre with a custom memory allocator based on this, or is the</div>

<div>goal to optimize the kernel vmalloc() allocation using the radix tree code?</div>

<div><br class="">

</div>

<div>I think the use of a custom memory allocator in Lustre would be far more nasty than</div>

<div>lots of the things that are raised as objections to upstream inclusion, so I think it</div>

<div>would be a step backward.  Optimizing vmalloc() in upstream kernels (if changes</div>

<div>are accepted) would be a better use of time.  For the few sites that have many OSTs,</div>

<div>they can afford a kernel patch on the client (likely they have a custom kernel from</div>

<div>their system vendor anyway), and the other 99% of users will not need it.</div>

<div><br class="">

</div>

<div><br class="">

</div>

<div>I think a more practical approach might be to have a pool of preallocated reply</div>

<div>buffers (using vmalloc()) that is kept on the client.  That would avoid the overhead</div>

<div>of vmalloc/vfree each time, and not need intrusive code changes.  In the likely</div>

<div>case of a small layout for a file (even if _some_ layouts are very large), the saved</div>

<div>RPC replay buffer can be kmalloc'd normally and copied over.  I don't think there</div>

<div>will be real-world workloads where a client is keeping thousands of different files</div>

<div>open with huge layouts, so it is likely that the number of large buffers in the reply</div>

<div>pool will be relatively small.</div>

<br class="">

<div class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

<div>Cheers, Andreas</div>

<div>--</div>

<div>Andreas Dilger</div>

<div>Principal Lustre Architect</div>

<div>Whamcloud</div>

<div><br class="">

</div>

<div><br class="">

</div>

<div><br class="">

</div>

</div>

</div>

</div>

</div>

</div>

<br class="Apple-interchange-newline">

<br class="Apple-interchange-newline">

</div>

<br class="">

</body>

</html>