<font><font face="arial,helvetica,sans-serif"><br>Hi Andreas,<br><br><font>Thanks for <font>your update! Some comments belo<font>w.<br><br><font>JF</font><br></font></font></font></font><br></font><br><div class="gmail_quote">

On Mon, Oct 15, 2012 at 7:04 PM, Dilger, Andreas <span dir="ltr"><<a href="mailto:andreas.dilger@intel.com" target="_blank">andreas.dilger@intel.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im">On Oct 15, 2012, at 1:01 PM, Jean-Francois Le Fillatre wrote:<br>

> Yes this is one strange formula... There are two ways of reading it:<br>

><br>

> - "one thread per 128MB of RAM, times the number of CPUs in the system"<br>

> On one of our typical OSSes (24 GB, 8 cores), that would give: ((24*1024) / 128) * 8 = 1536<br>

</div>> And that's waaaay out there…<br>

<br>

This formula was first created when there was perhaps 2GB of RAM and 2 cores in the system, intended to get some rough correspondence between server size and thread count.  Note that there is also a default upper limit of 512 for threads created on the system.  However, on some systems in the past with slow/synchronous storage, having 1500-2000 IO threads was still improving performance and could be set manually.  That said, it was always intended as a reasonable heuristic and local performance testing/tuning should pick the optimal number.<br>


<div class="im"><br>

> - "as many threads as you can fit (128MB * numbers of CPUs) in the RAM of your system"<br>

> Which would then give: (24*1024) / (128*8) = 24<br>

<br>

</div>This isn't actually representing what the formula calculates.<br>

<div class="im"><br>

> For a whole system, that's really low. But for one single OST, it almost makes sense, in which case you'd want to multiply that by the number of OSTs connected to your OSS.<br>

<br>

</div>The rule of thumb that I've seen in the past, based on benchmarks at many sites is 32 threads/OST, which will keep the low-level elevators busy, but not make the queue depth too high.<br>

<div class="im"><br>

> The way we did it here is that we identified that the major limiting parameter is the software RAID, both in terms of bandwidth performance and CPU use. So I did some tests on a spare machine to get some load and perf figures for one array, using sgpdd-survey. Then, taking into account the number of OST per OSS (4) and the overhead of Lustre, I figured out that the best thread count would be around 96 (which is 24*4, spot on).<br>


><br>

> One major limitation in Lustre 1.8.x (I don't know if it has changed in 2.x) is that only the global thread count for the OSS can be specified. We have cases where all OSS threads are used on a single OST, and that completely trashes the bandwidth and latency. We would really need a max thread count per OST too, so that no single OST would get hit that way. On our systems, I'd put the max OST thread count at 32 (to stay in the software RAID performance sweet spot) and the max OSS thread count at 96 (to limit CPU load).<br>


<br>

</div>Right.  This is improved in Lustre 2.3, which binds the threads to specific cores.  I believe it is also possible to bind OSTs to specific cores as well for PCI/HBA/HCA affinity though I'm not 100% sure if the OST/CPU binding was included or not.<br>

</blockquote><div> </div><div>Even in I could bind both OST and threads to a given CPU, it's only a topological optimization for bandwidth and latency, but what would prevent a thread to answer a request for a target that is bound to another CPU? I mean, this is a very nice feature, and with proper configuration it can bring some notable improvements in performance, but I fail to see how it would solve the issue of having all threads on an OSS hammering a single OST.<br>

<br>I am aware that this is a border case, in general use there the load is spread over multiple targets and there's no problem. But we've hit it here a few times, and I know of some other sites where they have had the issue too. If you combine that with RAID issues (like slow disk / read errors / disk failure / rebuild or resync), you have a machine that locks up so bad that a cold reset is the only way to get it back under control.<br>

<br>Worst case? Yes. But because the consequences of such a situation can be so nasty, I would be very happy to be able to control thread allocation per OST more finely.<br><br> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><div class="h5">

> Thanks!<br>

> JF<br>

><br>

><br>

><br>

> On Mon, Oct 15, 2012 at 2:20 PM, David Noriega <<a href="mailto:tsk133@my.utsa.edu">tsk133@my.utsa.edu</a>> wrote:<br>

> How does one estimate a good number of service threads? I'm not sure I<br>

> understand the following: 1 thread / 128MB * number of cpus<br>

><br>

> On Wed, Oct 10, 2012 at 9:17 AM, Jean-Francois Le Fillatre<br>

> <<a href="mailto:jean-francois.lefillatre@clumeq.ca">jean-francois.lefillatre@clumeq.ca</a>> wrote:<br>

> ><br>

> > Hi David,<br>

> ><br>

> > It needs to be specified as a module parameter at boot time, in<br>

> > /etc/modprobe.conf. Check the Lustre tuning page:<br>

> > <a href="http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html" target="_blank">http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html</a><br>

> > <a href="http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTuning.html" target="_blank">http://wiki.lustre.org/manual/LustreManual20_HTML/LustreTuning.html</a><br>

> ><br>

> > Note that once created, the threads won't be destroyed, so if you want to<br>

> > lower your thread count you'll need to reboot your system.<br>

> ><br>

> > Thanks,<br>

> > JF<br>

> ><br>

> ><br>

> > On Tue, Oct 9, 2012 at 6:00 PM, David Noriega <<a href="mailto:tsk133@my.utsa.edu">tsk133@my.utsa.edu</a>> wrote:<br>

> >><br>

> >> Is this a parameter, ost.OSS.ost_io.threads_max, when set via lctl<br>

> >> conf_parm will persist between reboots/remounts?<br>

> >> _______________________________________________<br>

> >> Lustre-discuss mailing list<br>

> >> <a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>

> >> <a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > --<br>

> > Jean-François Le Fillâtre<br>

> > Calcul Québec / Université Laval, Québec, Canada<br>

> > <a href="mailto:jean-francois.lefillatre@clumeq.ca">jean-francois.lefillatre@clumeq.ca</a><br>

> ><br>

><br>

><br>

><br>

> --<br>

> David Noriega<br>

> CSBC/CBI System Administrator<br>

> University of Texas at San Antonio<br>

> One UTSA Circle<br>

> San Antonio, TX 78249<br>

> Office: BSE 3.114<br>

> Phone: <a href="tel:210-458-7100" value="+12104587100">210-458-7100</a><br>

> <a href="http://www.cbi.utsa.edu" target="_blank">http://www.cbi.utsa.edu</a><br>

><br>

> Please remember to acknowledge the RCMI grant , wording should be as<br>

> stated below:This project was supported by a grant from the National<br>

> Institute on Minority Health and Health Disparities (G12MD007591) from<br>

> the National Institutes of Health. Also, remember to register all<br>

> publications with PubMed Central.<br>

> _______________________________________________<br>

> Lustre-discuss mailing list<br>

> <a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>

> <a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>

><br>

><br>

><br>

> --<br>

> Jean-François Le Fillâtre<br>

> Calcul Québec / Université Laval, Québec, Canada<br>

> <a href="mailto:jean-francois.lefillatre@clumeq.ca">jean-francois.lefillatre@clumeq.ca</a><br>

><br>

> _______________________________________________<br>

> Lustre-discuss mailing list<br>

> <a href="mailto:Lustre-discuss@lists.lustre.org">Lustre-discuss@lists.lustre.org</a><br>

> <a href="http://lists.lustre.org/mailman/listinfo/lustre-discuss" target="_blank">http://lists.lustre.org/mailman/listinfo/lustre-discuss</a><br>

<br>

</div></div>Cheers, Andreas<br>

--<br>

Andreas Dilger<br>

Lustre Software Architect<br>

Intel Corporation<br>

<br>

<br>

<br>

<br>

<br>

<br>

</blockquote></div><br><br clear="all"><br>-- <br>Jean-François Le Fillâtre<br>Calcul Québec / Université Laval, Québec, Canada<br><a href="mailto:jean-francois.lefillatre@clumeq.ca" target="_blank">jean-francois.lefillatre@clumeq.ca</a><br>

<br>