[Lustre-discuss] question about size on MDS (MDT) for lustre-1.8

Tue Feb 1 10:38:15 PST 2011

On Jan 28, 2011, at 10:04 AM, Andreas Dilger wrote:

>> In the absence of controls on the size of the page cache, or enough RAM to cache all of the inode and directory blocks in memory, another potential solution is to place the metadata on an SSD. One can generate a dm linear target table that carves up an ext3/ext4 filesystem such that the inode blocks go on one device, and the data blocks go on another. Ideally the inode blocks would be placed on an SSD. 
>> 
>> I've tried this with both ext3, and with ext4 using flex_bg to reduce the size of the dm table. IIRC the overhead is acceptable in both cases - 1us, on average.
> 
> I'd be quite interested to see the results of such testing.

I'm waiting for more hardware to show up so I can restart my testing. Hope to have some results to share in another 3-4 weeks. 

>> Placing the inodes on separate storage is not sufficient, though. Slow directory block reads contribute to poor stat performance as well. Adding a feature to ext4 to reserve a number of fixed block groups for directory blocks, and always allocating them there, would help. Those blocks groups could then be placed on an SSD as well.
> 
> I believe there is a heuristic that allocates directory blocks in the first group of a flex_bg, so if that entire group is on SSD it would potentially avoid this problem.

There is, though I haven't tested it yet. However, you'd need to have a relatively small number of flex_bgs for this to be cost-effective. I heard through the grapevine that you suggest not using "too few" flex_bgs on an ext4 filesystem. Can you elaborate on what might be a reasonable number, and why?

Thanks,

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035