[Lustre-discuss] high OSS load - readcache_max_filesize

Sat May 7 00:08:13 PDT 2011

On 2011-05-05, at 11:39 AM, Thomas Roth <t.roth at gsi.de> wrote:
> a recent posting here (which I can't find atm) has pointed me to 
> http://jira.whamcloud.com/browse/LU-15, where an issue is discussed that 
> we seem to see as well: some OSS really get overloaded, and the log says
> 
> slow journal start 36s due to heavy IO load
> slow commitrw commit 36s due to heavy IO load
> slow start_page_read 169s due to heavy IO load
> slow direct_io 34s due to heavy IO load
> ...
> 
> The bugzilla discussion seems to propose a number of steps to go on each 
> OSS as a workaround, among them setting
> readcache_max_filesize=32M  or  readcache_max_filesize=0
> 
> I have checked the current value of this parameter and found
> readcache_max_filesize=18446744073709551615
> which translates to 16 EB (if I counted the powers of 1024 correctly).

Right - this is 2^64 - 1.

> Am I correct assuming that this is the default value, and that this 
> default is meant to read "unlimited"?

Correct. 

> Or is our OSS configuration just 
> badly messed up?
> 
> 
> Also, people recommend pinning the bitmaps to memory - how do you do that?

There is no mechanism to do this today. It is possible to preload the bitmaps at mount time, or I guess it may be possible to write a program that mapped the bitmaps from the disk and then mlock'd that memory, but it would pin 32 MB of RAM per TB of filesystem. If an OSS has 8x 8TB OSTs that is 2GB of RAM. I think there are more efficient solutions than this. 

> Preallocation tables all seem to contain "256 512 1024", so no shrinking 
> of prealloc_table is necessary.
> The OSTs in question have just reached the 85% level. We have a number 
> of older OSS which are closer to 95% - I guess the problem doesn't show 
> up there, because there is no room for further files anyhow...

This is considered very full for the filesystem, so it isn't very surprising that you are seeing such messages. In the future, the flex_bg option will be usable for new filesystems, but that won't help existing filesystems today. 

Cheers, Andreas