[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Fri Jan 28 09:45:10 PST 2011

On Jan 27, 2011, at 11:34 PM, Robin Humble wrote:

> On Thu, Jan 13, 2011 at 05:28:23PM -0500, Kit Westneat wrote:
>>> It would probably be better to set:
>>> 
>>> lctl conf_param fsname-OST00XX.ost.readcache_max_filesize=32M
>>> 
>>> or similar, to limit the read cache to files 32MB in size or less (or whatever you consider "small" files at your site.  That allows the read cache for config files and such, while not thrashing the cache while accessing large files.
>>> 
>>> We should probably change this to be the default, but at the time the read cache was introduced, we didn't know what should be considered a small vs. large file, and the amount of RAM and number of OSTs on an OSS, and the uses varies so much that it is difficult to pick a single correct value for this.
> 
> limiting the total amount of OSS cache used in order to leave room for
> inodes/dentries might be more useful. the data cache will always fill
> up and push out inodes otherwise.

The inode and dentry objects in the slab cache aren't so much of an issue as having the disk blocks that each are generated from available in the buffer cache. Constructing the in-memory inode and dentry objects is cheap as long as the corresponding disk blocks are available. Doing the disk reads, depending on your hardware and some other factors, is not.

> Nathan's approach of turning off the caches entirely is extreme, but if
> it gives us back some metadata performance then it might be worth it.

We went the extreme and disabled the OSS read cache (+ writethrough cache). In addition, on the OSSes we pre-read all of the inode blocks that contain at least one used inode, along with all of the directory blocks. 

The results have been promising so far. Firing off a du on an entire filesystem, 3000-6000 stats/second is typical. I've noted a few causes of slowdowns so far; there may be more.

First, no attempt has been made to pre-read metadata from the MDT. The need to read in inode and directory blocks may slow things down quite a bit. I can't find the numbers in my notes at the moment, but I recall seeing 200-500 stats/second when the MDS needed to do I/O.

When memory runs low on a client, kswapd kicks in to try and free up pages. On the client I'm currently testing on, almost all of the memory used is in the slab. It looks like kswapd has a difficult time clearing things up, and the client can go several seconds before the current stat call is completed. Dropping caches will (temporarily) get the performance back to expected rates. I haven't dug into this one too much yet.

Sometimes the performance drop is worse, and we see just tens of stats/second (or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} all need to take a lock on the parent directory of the object on the OST. Unlink or precreate operations whose critical section protected by this lock take a long time to complete will slow down stat requests. I'm working on tracking down the cause of this; it may be journal related. BZ 22107 is probably relevant as well.

> or is there a Lustre or VM setting to limit overall OSS cache size?

No, but I think that would be really useful in this situation.

> I presume that Lustre's OSS caches are subject to normal Linux VM
> pagecache tweakables, but I don't think such a knob exists in Linux at
> the moment...

Correct on both counts. A patch was proposed to do this, but I don't see any evidence of it making it into the kernel:

http://lwn.net/Articles/218890/

I have a small set of perl, bash, and SystemTap scripts to read the inode and directory blocks from disk and monitor the performance of the relevant Lustre calls on the servers. I'll clean them up and send them to the list next week. A more elegant solution would be to get e2scan to do the job, but I haven't taken a hack at that yet.

Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 inode blocks/group), ~36% have at least one inode used. We pre-read those and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an average of 3891 directory blocks per OST.

In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. 

I've tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average.

Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well.

Even with the inode and directory blocks on fast storage, stat performance will still suffer when other operations that require a lock on the object's parent directory are going slow.

I've left out a few details and actual performance numbers from our production systems. I'll do a more detailed writeup after I take care of some other things at work, and finish recovering from 13.5 timezones worth of jet lag :-) 

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035