[Lustre-devel] 64-bit ext4 inodes

Thu Apr 21 15:56:27 PDT 2011

On Apr 20, 2011, at 11:07 AM, Nathan Rutman wrote:
> We're about to start looking into 64-bit inodes on ext4 -- anybody else working on this at the moment?

I don't know of anyone other than Fujitsu that are looking at this.  I haven't had any kind of technical discussion with them about it, nor looked at their code, but just implied that this is what they did based on their slides.

One important consideration to note with 64-bit inode numbers is that they cannot be used safely on 1.8 MDS filesystems.  That is because the IGIF FID namespace only has room for 2^32 inode numbers (mapped into the 2.x FID SEQ field, see ), so upgrading a 1.8 MDS with 64-bit inode numbers to 2.x would cause a huge world of hurt.  If this is limited to 2.x filesystems that only identify MDS inodes via FIDs to the clients then this is not a concern.

If you are concerned with being able to upgrade 32-bit inode filesystems into 64-bit inode filesystems, you should look at the data-in-dirent patch that is currently in the 2.x ldiskfs patchset.  It was developed to allow storing the 128-bit Lustre FID in the directory entry in a compatible manner, but was also designed to allow storing the high 32 bits of a 64-bit inode number into the directory entry.  That allows compatibility for upgrades by avoiding the need to atomically upgrade a whole directory full of dirents to have 64-bit inode number fields.  It also avoids the need to store 64-bit inode numbers when most of them are 32-bit values.

I'm not against the idea of exploring this approach, but I am concerned that e2fsck time will continue to grow with the number of inodes in a single filesystem, and scalability of a single MDT will increasingly be an issue.  

AFAIK, e2fsck times are currently on the order of 1h/100M inodes, and no Lustre filesystems that I know of are above 500M inodes today just due to the average file size being so large and the long e2fsck times.  With improvements in ext4 to reduce e2fsck time like flex_bg, and SSDs, it may be possible to reduce the e2fsck time/inode ratio noticeably, but I think it would take more effort on the e2fsck side than the ext4 side to make many billions of inodes in a single MDT a practical approach.

Things like metadata prefetching that Dave Dillow was doing for e2scan with event-driven completion handlers that process the blocks whenever they arrive from the disk, and multi-threading some of the passes of e2fsck with an understanding of the underlying disk layout (using the s_raid_stride and s_raid_stripe_width).

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.