[Lustre-devel] Quota enforcement

Tue Apr 19 11:00:23 PDT 2011

On 2011-04-19, at 7:39 AM, "Eric Barton" <eeb at whamcloud.com> wrote:
> I'd like to take a fresh look at quota enforcement.  I think the
> current approach of trying to implement quota purely through POSIX
> APIs is flawed, and I'd like to open up a debate on alternatives.
> 
> If we go back to first premises, quota enforcement is about resource
> management - tracking and enforcing limits on consumption to ensure
> some measure of insulation between different users.  In general, when
> we have 'n' resources which are all consumed independently we should
> also track and enforce limits on each of these independently.

Agreed. In ldiskfs this means tracking inodes and blocks separately, because they have very different constraints. It also means that the quota for MDT and OST space needs to be tracked separately. 

> In conventional filesystems the relevant resources are inodes and
> blocks - which POSIX quota matches nicely.  Although it may seem to
> simplify quota management to equate the POSIX quota inode count with
> the MDS's inode count, and the POSIX quota block count with the sum of
> all blocks on the OSTs, it ignores the following issues...
> 
> 1. Block storage on the MDS must be sized to ensure it is not
>   exhausted before inodes on the MDS run out.  This requires
>   assumptions about the average size of Lustre directories and
>   utilisation of extended attributes.

Agreed. This is definitely the case with current MDTs - the amount of free space is currently always far in excess of what is needed for the number of inodes. For HDDs this is fine because MDT space is "free" when adding spindles for the IOPS.

In any case, the current quota does account for space usage on the MDT. This is important for future data-on-MDT cases. The inode consumption is also a useful metric because it allows the admin to track the fixed ldiskfs inode resource, and also to compute average file size on a per user basis. 

> 2. Sufficient inodes must be reserved on the OSTs to ensure they are
>   not exhausted before block storage.  This requires assumptions
>   about the average Lustre file size and number of stripes.

That is true, but inevitable with the structure of ldiskfs. In the past we have FAR over-provisioned the inodes on ldiskfs OSTs by default because we cannot know in advance what the user is going to be doing with their filesystem.  There is guidance in the manual for tuning this at setup time, but not everyone reads the manual...

I consider the use of inodes/objects on OSTs to be a side-effect of the Lustre implementation, so as long as we have enough I don't think they should be accounted separately, and definitely not in the same bucket as MDT inodes. This is because the performance characteristics of a single file vary with the number of objects == OSTs, and the user may not even be controlling this mapping in the future with dynamic layouts. 

With newer filesystems that have dynamic inode allocation, the concept of inodes being a constrained resource on the OSTs is gone, and essentially this is just a small overhead of space on the OST, like an indirect block is.

> 3. Imbalanced OST utilization causes allocation failures while
>   resources are still available on other OSTs.

That is true whether the allocation failure is due to the OST filling up with a dynamic quota, or if the quota is running out on an OST with a static quota. Both problems would ideally be avoided by perfectly even usage, but that is not realistic. 

> (3) is the most glaringly obvious issue.  It gives you ENOSPACE when
> you extend a file if one of the OSTs it's striped over is full.  Very
> irritating if 'df' reports that plenty of space is still available and
> it's not something the quota system itself can help you avoid.  
> 
> In fact quota enforcement currently takes pains to allow quota
> utilisation to become imbalanced across OSTs by dynamically
> distributing the user's quota to where it's being used.  This comes at
> a performance cost as quota nears exhaustion.  Provided the user
> operates well within her quota, quota is distributed in large units
> with low overhead.  However as she nears her limit, protocol overhead
> increases as quota is distributed in successively smaller units to
> ensure it is not all consumed prematurely on one OST.
> 
> An alternative approach to (3) is to move the usage to where the
> resources are - i.e. implement complex/dynamic file layouts that
> effectively allow files to grow dynamically into available free space.
> This works not just for quota enforcement but for all free space.
> However it also comes at the cost of increasing overhead as
> space/quota is exhausted.  It's also much harder to implement -
> especially for overwriting "holes" rather than simply extending files.

This doesn't seem like a huge win to me, and comes at a cost in implementation complexity. I agree that dynamic layouts is something that we've discussed for a long time already, and we are moving in that direction with the layout lock from HSM, but we are still some distance away from this today, IMHO.

We also have to consider that dynamic layout for large files may not always be a magic bullet. If there are many processes writing what the MDT considers "large" files, then always increasing the stripe count would only serve to increase contention on the OSTs, and not improve apace balance at all. 

> I'd dearly like some surveys of real-world systems to discover exactly
> how imbalanced utilisation can really become, both for individual
> users and also in aggregate to provide guidance on how to proceed.

The common case of severe imbalance that I'm aware of is a user creating a huge tar file of their output, but not specifying a wide striping, so thousands of files currently spread over many OSTs are copied onto one or two OSTs where the tar file is located.

The only other real-world case I'm aware of is applications/users specifying striping with an index == 0 instead of -1 that causes the first OST(s) to fill unevenly.  I'm some sense, this is an artifact of the first problem, that users need to specify striping in the first place.

With balanced round-robin allocation, the majority of "general" imbalance will be avoided, and dynamic layouts would help those extreme cases of large imbalanced files. 

> I'm leaning towards static quota distribution since that matches the
> physical constraints, but it requires much better tools (e.g.  for
> rebalancing files and reporting not just utilization totals but also
> median/min etc).

Cheers, Andreas