<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:12.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

span.EmailStyle19

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

--></style>

</head>

<body lang="EN-US" link="#0563C1" vlink="purple" style="word-wrap:break-word">

<div class="WordSection1">

<p class="MsoNormal"><span style="font-size:11.0pt">Perhaps a better question to ask (although very closely related) would be how can we improve the MD tests in the io500 benchmark? 

<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">In the info below this is the info on these file systems:<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">nobackup – a lustre FS on the hardware we've been discussing with a ZFS MDT, nominally running on mds0<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">ephemeral – a lustre FS on the hardware we've been discussing with an ldiskfs MDT, nominally running on mds1<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">scratch – a standard NFS mount<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">local – a local SSD<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">A little more background on the motivation here.  We have some fairly large software development projects in the lab.  One of the largest active projects has a git repo with about 500,000 files totaling 5

 GB in size. A clone of these repo takes 550 seconds on lustre and about 150 seconds on NFS.  A status takes 15 seconds on lustre and 3 seconds on NFS.  Not surprisingly, the timings are greatly reduced on a local SSD.  See the attached plot in git_timings.pdf

 for details.  The slowness on lustre is largely (completely?) driven by the MD performance.  Obviously, we work with the repo on a local file system when possible to avoid the performance hit.  But one of the workflows involves Monte Carlo analysis against

 this repo, varying dozens of parameters, running 1000's of cases and analyzing the results.  This produces a lot of data and necessitates the shared FS for both running the Monte Carlo cases and simply storing the amounts of data these runs produce.

<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">There are several other scenarios in which we are working with smaller, but still sizeable, data sets (git repos and other forms) on the lustre file system and the MD sluggishness is noticeable and annoying. 

 So we would like to try and improve MD performance.  <o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">To further characterize and compare the IO performance on these file systems, I've run the io500 benchmarks.  The attached plots show the results.  This is a completely "out of the box" run on a single node. 

 That is, I'm just running "./io500.sh config-minimal.ini".  (I've run the 10-node results too (or tried to) for more direct comparison to the results on io500.org but that's a slightly different objective.)  I figure the single node run is analogous to a person

 working with a git repo scenario.  This is on a 10 gigabit ethernet client.  Details attached but the MD results are fairly consistent with the above git timings – lustre is about 3x to 10x slower than NFS.  I'd be curious to get some feedback on these MD

 performance numbers.  Do they seem low compared to other LFS's out there?  As I mentioned in the original post in this thread, our numbers are quite low when compared to even the lowest numbers on the current io500 list. 

<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">How is MD performance expected to increase with increasing numbers of clients?  I know bandwidth increases as you grab more OST' but would MD performance be expected to increase at all?  We are not using DoM

 or DNE. <o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Also as mentioned before, we will upgrade lustre soon.  I'd like to stick with the 2.12 LTS stream.  But would the upcoming 2.14 have any potential MD performance advantages? 

<o:p></o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b><span style="color:black">From: </span></b><span style="color:black">lustre-discuss <lustre-discuss-bounces@lists.lustre.org> on behalf of "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1@nasa.gov><br>

<b>Date: </b>Wednesday, January 6, 2021 at 9:29 AM<br>

<b>To: </b>Andreas Dilger <adilger@whamcloud.com><br>

<b>Cc: </b>"lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org><br>

<b>Subject: </b>Re: [lustre-discuss] [EXTERNAL] Re: Tuning for metadata performance<o:p></o:p></span></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:11.0pt"><o:p> </o:p></span></p>

</div>

<p class="MsoNormal"><span style="font-size:11.0pt">My apologies – I posted some bad info.  While we started out with the HDD's in the MDS, pretty early on we switched to SSD's.  So that's not the source of our MD slowness.  Can you do NVMe in an external JBOD? 

</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b><span style="color:black">From: </span></b><span style="color:black">Andreas Dilger <adilger@whamcloud.com><br>

<b>Date: </b>Tuesday, January 5, 2021 at 11:51 AM<br>

<b>To: </b>"Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicker-1@nasa.gov><br>

<b>Cc: </b>"lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org><br>

<b>Subject: </b>[EXTERNAL] Re: [lustre-discuss] Tuning for metadata performance</span><o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

</div>

<p class="MsoNormal">Probably the best single thing you could do for metadata performance

<o:p></o:p></p>

<div>

<p class="MsoNormal">would be to switch to SSD, or better NVMe, storage.  ZFS is very sync<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">and IOPS hungry, so using HDDs is killer for ZFS metadata performance.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">If you want to minimize the downtime, you could incrementally replace the<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">HDDs in the zpool with larger SSD devices and resilver between each<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">one.  I recall LLNL doing this in the first months of their first ZFS-based<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">Lustre filesystem for this reason.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">Going to NVMe-based devices is even better for IOPS/bandwidth, but<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">can't be done completely live.  You could potentially use repeated zfs<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">send/recv to get an almost uptodate copy on a new MDS, then take a small<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">outage to do the final resync. However, I've also seen reports that send/recv is painfully slow with HDD MDTs so you should probably test that before committing to a solution. <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<p class="MsoNormal">Cheers, Andreas<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal"><br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal" style="margin-bottom:12.0pt">On Jan 5, 2021, at 08:47, Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] <darby.vicker-1@nasa.gov> wrote:<o:p></o:p></p>

</blockquote>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<p class="MsoNormal"><span style="font-size:11.0pt">Hello,</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">I'm looking for some advice on tuning our existing lustre file system to achieve better metadata performance.  This file system is getting fairly old – its been in production for almost 4 years now.  The hardware

 and our existing tuning efforts can be found here. </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"><a href="https://gcc02.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Fpipermail%2Flustre-discuss-lustre.org%2F2017-April%2F014390.html&data=04%7C01%7Cdarby.vicker-1%40nasa.gov%7C31c6f19d61644d8f288808d8b2603d72%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637455473704814500%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0N7t8sXSdxRHBVDZX1yPrwwyy0l38LYq46GY%2BYovqas%3D&reserved=0">http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2017-April/014390.html</a></span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">The hardware is the same but we have upgraded the software stack a few times – now on CentOS 7.6, ZFS 0.7.9 and lustre 2.10.8.  We do plan to upgrade to the latest CentOS 7.x and either lustre 2.12 or 2.13

 soon.  The MDS hardware isn't well-described in that thread so here are more details:</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Chassis: Supermicro 2U Twin Server</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Processor: 4 x QuadCore Xeon Processor E52637 v2 3.50GHz (2 sockets/8 cores per node)</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Memory: 16 x 16GB PC314900 1866MHz DDR3 ECC Registered DIMM (128GB per node)</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">External JBOD:</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Chassis: 24x HotSwap 2.5" SAS  12Gb/s SAS Dual Expander</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Drives: 12 x 600GB SAS 3.0 12.0Gb/s 15000RPM  2.5"  Seagate Enterprise Performance 15K HDD (512n)</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Controller Card: LSI SAS 9300-8e SAS 12Gb/s PCIe 3.0 8-Port Host Bus Adapter</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">The above hardware and tuning served us well for a long time but the lab has grown, both in number of lustre clients (now up to ~200 ethernet clients and ~500 IB clients) and the number of users in the lab. 

 With the extra users have come different types of workloads.  Peviously, the file system was most used for workloads with a fairly small number of large files.  We now see workloads that include 100's of concurrent processes all doing mixed small and large

 file IO on a lot of files (e.g. each process clones a repo, compiles a code and runs a serial sim that writes a lot of data). 

</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">I recently ran the io500 tests and our LFS stats for MDEasy and MDHard are pretty bad, even when compared to the lowest MD stats on the current io500 list.  Our standard NFS server handily beats our LFS wrt

 MD performance.  So I'm hopeful that we can squeeze more MD performance out of our LFS.  Obviously, software tuning on the existing hardware would be preferred but we are open to hardware additions/upgrades if that would help (e.g. adding more MDS's).  There

 are a lot of tuning options in both ZFS and lustre so I'm hoping someone can point me in the right direction.  Are DNE and/or DoM expected to help?  I attended the SC20 Lustre BoF and it sounds like 2.13 has some metadata performance improvements, so just

 an upgrade might help.  We have dual MDS's now but for HA, not performance.  I'd hate to lose the HA aspect as we utilize it for failover quite a bit (maintenance, etc.) but it would probably be worth it if MD performance was significantly improved.  If I

 understand correctly, there is some overhead with DNE and performance suffers with just two MDS's with a benefit with 4 or more MDS's, correct?  So that wouldn't be a good option for us unless we add MDS's?  Would an upgrade to SSD or NVMe in our MDTs help? 

</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">I would greatly appreciate thoughts on the best path forward for making improvements. 

</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt"> </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,</span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">Darby </span><o:p></o:p></p>

<p class="MsoNormal"><span style="font-size:11.0pt">_______________________________________________<br>

lustre-discuss mailing list<br>

lustre-discuss@lists.lustre.org<br>

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org</span><o:p></o:p></p>

</div>

</blockquote>

</div>

</div>

</body>

</html>