[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Fri Jan 28 10:04:24 PST 2011

On 2011-01-28, at 10:45, Jason Rappleye wrote:
> Sometimes the performance drop is worse, and we see just tens of stats/second (or fewer!) This is due to the fact that filter_{fid2dentry,precreate,destory} all need to take a lock on the parent directory of the object on the OST. Unlink or precreate operations whose critical section protected by this lock take a long time to complete will slow down stat requests. I'm working on tracking down the cause of this; it may be journal related. BZ 22107 is probably relevant as well.

There is work underway to allow the locking of the ldiskfs directories to be multi-threaded.  This should significantly improve performance in such cases.

> Our largest filesystem, in terms of inodes, has about 1.8M inodes per OST, and 15 OSTs per OSS. Of the 470400 inode blocks on disk (58800 block groups * 8 inode blocks/group), ~36% have at least one inode used. We pre-read those and ignore the empty inode blocks. Looking at the OSTs on one OSS, we have an average of 3891 directory blocks per OST.
> 
> In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. 
> 
> I've tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average.

I'd be quite interested to see the results of such testing.

> Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well.

I believe there is a heuristic that allocates directory blocks in the first group of a flex_bg, so if that entire group is on SSD it would potentially avoid this problem.

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.