[lustre-discuss] billions of 50k files

Wed Nov 29 18:08:07 PST 2017

On Nov 29, 2017, at 15:31, Brian Andrus <toomuchit at gmail.com> wrote:
> 
> All,
> 
> I have always seen lustre as a good solution for large files and not the best for many small files.
> Recently, I have seen a request for a small lustre system (2 OSSes, 1 MDS) that would be for billions of files that average 50k-100k.

This is about 75TB of usable capacity per billion files.  Are you looking at HDD or SSD storage?  RAID or mirror?  What kind of client load, and how much does this system need to scale in the future?

> It seems to me, that for this to be 'of worth', the block sizes on disks need to be small, but even then, with tcp overhead and inode limitations, it may still not perform all that well (compared to larger files).

Even though Lustre does 1MB or 4MB RPCs, it only allocates as much space on the OSTs as needed for the file data.  This means 4KB blocks with ldiskfs, and variable (power-of-two) blocksize on ZFS (64KB or 128KB blocks by default). You could constrain ZFS to smaller blocks if needed (e.g. recordsize=32k), or enable ZFS compression to try and fit the data into smaller blocks (depends whether your data is compressible or not).

The drawback is that every Lustre file currently needs an MDT inode (1KB+) and an OST inode, so Lustre isn't the most efficient for small files.

> Am I off here? Have there been some developments in lustre that help this scenario (beyond small files being stored on the MDT directly)?

The Data-on-MDT feature (DoM) has landed for 2.11, which seems like it would suit your workload well, since it only needs a single MDT inode for small files, and reduces the overhead when accessing the file.  DoM will still be a couple of months before that is released, though you could start testing now if you were interested.  Currently DoM is intended to be used together with OSTs, but if there is a demand we could look into what is needed to run an MDT-only filesystem configuration (some checks in the code that prevent the filesystem becoming available before at least one OST is mounted would need to be removed).

That said, you could also just set up a single NFS server with ZFS to handle the 75TB * N of storage, unless you need highly concurrent access to the files.  This would probably be acceptable if you don't need to scale too much (in capacity or performance), and don't have a large number of clients connecting.

One of the other features we're currently investigating (not sure how much interest there is yet) is to be able to "import" an existing ext4 or ZFS filesystem into Lustre as MDT0000 (with DoM), and be able to grow horizontally by adding more MDTs or OSTs.  Some work is already being done that will facilitate this in 2.11 (DoM, and OI Scrub for ZFS), but more would be needed for this to work.  That would potentially allow you to start with a ZFS or ext4 NFS server, and then migrate to Lustre if you need to scale it up.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation