[Lustre-discuss] Thread might be hung, Heavy IO Load messages

Wed Feb 1 13:11:38 PST 2012

zone_reclaim_mode is 0 on all clients/servers

When changing number of service threads or the lru_size, can these be
done on the fly or do they require a reboot of either client or
server?
For my two OSTs, cat /proc/fs/lustre/ost/OSS/ost_io/threads_started
give about 300(300, 359) so I'm thinking try half of that and see how
it goes?

Also checking lru_size, I get different numbers from the clients. cat
/proc/fs/lustre/ldlm/namespaces/*/lru_size

Client: MDT0 OST0 OST1 OST2 OST3 MGC
head node: 0 22 22 22 22 400 (only a few users logged in)
busy node: 1 501 504 503 505 400 (Fully loaded with jobs)
samba/nfs server: 4 440070 44370 44348 26282 1600

So my understanding is the lru_size is set to auto by default thus the
varying values, but setting it manually is effectively setting a max
value? Also what does it mean to have a lower value(especially in the
case of the samba/nfs server)?

On Wed, Feb 1, 2012 at 1:27 PM, Charles Taylor <taylor at hpc.ufl.edu> wrote:
>
> You may also want to check and, if necessary, limit the lru_size on your clients.   I believe there are guidelines in the ops manual.      We have ~750 clients and limit ours to 600 per OST.   That, combined with the setting zone_reclaim_mode=0 should make a big difference.
>
> Regards,
>
> Charlie Taylor
> UF HPC Center
>
>
> On Feb 1, 2012, at 2:04 PM, Carlos Thomaz wrote:
>
>> Hi David,
>>
>> You may be facing the same issue discussed on previous threads, which is
>> the issue regarding the zone_reclaim_mode.
>>
>> Take a look on the previous thread where myself and Kevin replied to
>> Vijesh Ek.
>>
>> If you don't have access to the previous emails, look at your kernel
>> settings for the zone reclaim:
>>
>> cat /proc/sys/vm/zone_reclaim_mode
>>
>> It should be set to 0.
>>
>> Also, look at the number of Lustre OSS service threads. It may be set to
>> high...
>>
>> Rgds.
>> Carlos.
>>
>>
>> --
>> Carlos Thomaz | HPC Systems Architect
>> Mobile: +1 (303) 519-0578
>> cthomaz at ddn.com | Skype ID: carlosthomaz
>> DataDirect Networks, Inc.
>> 9960 Federal Dr., Ste 100 Colorado Springs, CO 80921
>> ddn.com <http://www.ddn.com/> | Twitter: @ddn_limitless
>> <http://twitter.com/ddn_limitless> | 1.800.TERABYTE
>>
>>
>>
>>
>>
>> On 2/1/12 11:57 AM, "David Noriega" <tsk133 at my.utsa.edu> wrote:
>>
>>> indicates the system was overloaded (too many service threads, or
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
> Charles A. Taylor, Ph.D.
> Associate Director,
> UF HPC Center
> (352) 392-4036
>
>
>

-- 
David Noriega
System Administrator
Computational Biology Initiative
High Performance Computing Center
University of Texas at San Antonio
One UTSA Circle
San Antonio, TX 78249
Office: BSE 3.112
Phone: 210-458-7100
http://www.cbi.utsa.edu