[Lustre-devel] 64-bit ext4 inodes
Nathan_Rutman at xyratex.com
Thu Apr 21 16:36:11 PDT 2011
On Apr 21, 2011, at 3:56 PM, Andreas Dilger wrote:
> On Apr 20, 2011, at 11:07 AM, Nathan Rutman wrote:
>> We're about to start looking into 64-bit inodes on ext4 -- anybody else working on this at the moment?
> I don't know of anyone other than Fujitsu that are looking at this. I haven't had any kind of technical discussion with them about it, nor looked at their code, but just implied that this is what they did based on their slides.
Ah right, thanks.
> One important consideration to note with 64-bit inode numbers is that they cannot be used safely on 1.8 MDS filesystems. That is because the IGIF FID namespace only has room for 2^32 inode numbers (mapped into the 2.x FID SEQ field, see ), so upgrading a 1.8 MDS with 64-bit inode numbers to 2.x would cause a huge world of hurt. If this is limited to 2.x filesystems that only identify MDS inodes via FIDs to the clients then this is not a concern.
This would be for new systems; upgrading is not a concern.
> If you are concerned with being able to upgrade 32-bit inode filesystems into 64-bit inode filesystems, you should look at the data-in-dirent patch that is currently in the 2.x ldiskfs patchset. It was developed to allow storing the 128-bit Lustre FID in the directory entry in a compatible manner, but was also designed to allow storing the high 32 bits of a 64-bit inode number into the directory entry. That allows compatibility for upgrades by avoiding the need to atomically upgrade a whole directory full of dirents to have 64-bit inode number fields. It also avoids the need to store 64-bit inode numbers when most of them are 32-bit values.
> I'm not against the idea of exploring this approach, but I am concerned that e2fsck time will continue to grow with the number of inodes in a single filesystem, and scalability of a single MDT will increasingly be an issue.
> AFAIK, e2fsck times are currently on the order of 1h/100M inodes, and no Lustre filesystems that I know of are above 500M inodes today just due to the average file size being so large and the long e2fsck times. With improvements in ext4 to reduce e2fsck time like flex_bg, and SSDs, it may be possible to reduce the e2fsck time/inode ratio noticeably, but I think it would take more effort on the e2fsck side than the ext4 side to make many billions of inodes in a single MDT a practical approach.
With an on-line, continuous filesystem check (or other solutions) this should no longer be limiting :)
In any case, fsck time is a problem that needs to be solved, for 32-bit or 64-bit inodes.
> Things like metadata prefetching that Dave Dillow was doing for e2scan with event-driven completion handlers that process the blocks whenever they arrive from the disk, and multi-threading some of the passes of e2fsck with an understanding of the underlying disk layout (using the s_raid_stride and s_raid_stripe_width).
> Cheers, Andreas
> Andreas Dilger
> Principal Engineer
> Whamcloud, Inc.
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
More information about the lustre-devel