[Lustre-discuss] poor ost write performance.

Wed Apr 20 05:57:47 PDT 2011

Yes, difficulty finding free disk space can also be a problem, but I 
could not recall big changes in
how that worked since 1.6, other than memory pressure from the read 
cache pushing out the bitmaps.
See http://jira.whamcloud.com/browse/LU-15

Kevin

James Rose wrote:
> Hi Kevin,
>
> Thanks for the suggestion.  I will try this out.  
>
> For the moment it seems that it may be disk space related.  I have
> removed some data from the file system.  Performance returned to where I
> would expect it to be as space freed up (currently at 83% full).  Since
> free space I have seen two messages on an OSS where the number of
> threads is tuned to the amount of RAM in the host and six on an OSS that
> has the number of threads set higher than it should.  This is a much
> better situation than the steady stream I was experiencing last night.
> Maybe disabling the read cache will remove the last few.
>
> I am still very curious what the rapid small reads seen when writing are
> as this showed up while mounted ldiskfs so not doing regular lustre
> operations at all.
>
> Thanks again for your help,
>
> James.
>
>
> On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote:
>   
>> First guess is the increased memory pressure caused by the Lustre 1.8  
>> read cache.  Many times "slow" messages are caused by memory  
>> allocatons taking a long time.
>>
>> You could try disabling the read cache and see if that clears up the  
>> slow messages.
>>
>> Kevin
>>
>>
>> On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com>  
>> wrote:
>>
>>     
>>> Hi
>>>
>>> We have been experiencing degraded performance for a few days on a  
>>> fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms).  The  
>>> initial bulk load of the data will be fine but once in use for a  
>>> while writes become very slow to individual ost.  This will block io  
>>> for a few minutes and then carry on as normal.  The slow writes will  
>>> then move to another ost.  This can be seen in iostat and many slow  
>>> IO messages will be seen in the logs (example included)
>>>
>>> The osts are between 87 90 % full.  Not ideal but has not caused any  
>>> issues running 1.6.7.2 on the same hardware.
>>>
>>> The osts are RAID6 on external raid chassis (Infortrend).  Each ost  
>>> is 5.4T (small).  The server is Dual AMD (4 cores). 16G Ram. Qlogic  
>>> FC HBA.
>>>
>>> I mounted the osts as ldiskfs and tried a few write tests.  These  
>>> also show the same behaviour.
>>>
>>> While the write operation is blocked there will be hundreds of read  
>>> tps and a very small kb/s read from the raid but now writes.  As  
>>> soon as this completes writes will go through at a more expected  
>>> speed.
>>>
>>> Any idea what is going on?
>>>
>>> Many thanks
>>>
>>> James.
>>>
>>> Example error messages:
>>>
>>> Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ 
>>> lustre-log.1303271584.3935
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota  
>>> init 286s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
>>> start 39s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
>>> brw_start 39s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
>>> start 133s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
>>> brw_start 133s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
>>> start 236s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
>>> 40s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex  
>>> 277s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
>>> direct_io 286s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous  
>>> similar messages
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal  
>>> start 285s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
>>> similar message
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow  
>>> commitrw commit 285s due to heavy IO load
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous  
>>> similar message
>>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent  
>>> lock 236s due to heavy IO load
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>       
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>     
>
>
>