[Lustre-discuss] Question about setting max service threads

Mon Aug 15 19:56:36 PDT 2011

Andreas answered the question asked, and did an excellent job.

But to answer the unasked question, will reducing the thread count  
really fix the problem:

This is often NOT caused by mere disk overload from too many service  
threads.  For example, one recent issue was tracked down to free space  
allocation times being quite large, due to free space bitmaps needing  
to be read from disk.  It has also been common for memory allocations  
to be the major time sink, as with Lustre 1.8 the service threads no  
longer reuse the buffer and have to allocate new memory on every  
request (numa zoned allocations were especially problematic;  
apparently the "best" pages to free have a tendency of being found on  
the "wrong" numa node, so it took a lot of time/work to free up space  
on the local numa node to allow the allocation to succeed).

Bug 23826 had patches to track service times better, which will help  
you see how much of an issue this really is.

See also Bug 22516, which strives to normalize server threads per OST,  
rather than per server.

Big 22886 discusses issues with the elevator taking 1MB IOs and  
converting them into "odd" sizes, which depending on the array could  
also have an impact on IO.

Bug 23805 has some additional rambling along this line as well.

Kevin

On Aug 15, 2011, at 6:36 PM, Andreas Dilger <adilger at whamcloud.com>  
wrote:

> On 2011-08-15, at 3:58 PM, Mike Hanby wrote:
>> Our OSS servers are logging quite a few "heavy IO load" combined  
>> with system load (via 'uptime') being reported in the 100's to  
>> several 100's range.
>>
>> Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO  
>> load
>> Aug 15 13:00:38 lustre-oss-0-2 kernel: Lustre: Service thread pid  
>> 17651 completed after 236.04s. This indicates the system was  
>> overloaded (too many service threads, or there were not enough  
>> hardware resources).
>> Lustre: Skipped 1 previous similar message
>> Lustre: lustre-OST0004: slow commitrw commit 191s due to heavy IO  
>> load
>> Lustre: Service thread pid 16436 completed after 210.17s. This  
>> indicates the system was overloaded (too many service threads, or  
>> there were not enough hardware resources).
>>
>> I'd like to test setting the ost_io.threads_max to values lower  
>> than 512.
>>
>> Question 1: Will this command survive a reboot "lctl set_param  
>> ost.OSS.ost_io.threads_max=256"
>
> This is only a temporary setting.
>
>> or do I need to also run "lctl conf_param  
>> ost.OSS.ost_io.threads_max=256"?
>
> The conf_param syntax is (unfortunately) slightly different than the  
> set_param syntax.  You can also set this in /etc/modprobe.d/ 
> lustre.conf:
>
> options ost oss_num_threads=256
> options mds mds_num_threads=256
>
>> Question 2: Since Lustre "does not reduce the number of service  
>> threads in use", is there any way I can force the extra running  
>> service threads to exit, or is a reboot of the OSS servers the only  
>> clean way?
>
> I had written a patch to do this, but it wasn't landed yet.   
> Currently the only way to limit the thread count is to set this  
> before the number of running threads has exceeded the maximum thread  
> count.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss