[Lustre-devel] Moving forward on Quotas

Nikita Danilov Nikita.Danilov at Sun.COM
Sat May 31 09:19:25 PDT 2008


Ricardo M. Correia writes:
 > Hi Nikita,

Hello,

(I reordered some of the comments below.)

 > Currently, the space reported by st_blocks is calculated from
 > dnode_phys_t->dn_used, which in recent SPA versions tracks the number of
 > allocated bytes (not blocks) of a DMU object, which is accurate up to
 > the last committed txg.
 > Is this what you mean by "space usage"?

I meant a counter of bytes or blocks that this object occupies for the
quota purposes. I specifically don't want to identify `space usage' with
st_blocks, because for the modern file systems there is no _the_ way to
define what to include into quota: users want quota to be consistent
with both df(1) and du(1) and in the presence of features like snapshots
this is not generally possible.

 > What do you mean by mount? Do you mean when starting an OST?

Yes, OST or MDT.

 > First of all, I think you would need to keep track of objects changed in
 > the last 2 synced transaction groups, not just the last one. The reason

Indeed, I omitted this for the sake of clarity.

 > group N+1 may already be quiescing. This presents a challenge because if
 > the machines crashes, you may lose data in 2 transaction groups, not
 > just 1, which I think would make things harder to recover..

Won't it be enough to record in the pending list object from two last
transaction groups, if necessary?

 > 
 > Another problem it this: let's say the DMU is syncing a transaction
 > group, and starts calling ->space_usage() for objects. Now the machine
 > crashes, and comes up again.
 > Now how do you distinguish between which objects were called
 > ->space_usage() in the transaction group that was syncing and which
 > weren't (or how would you differentiate between ->space_usage() calls of

But we don't have to, if we make ->space_usage() idempotent, i.e.,
taking an absolute space usage as a last argument, rather than delta. In
that case, DMU is free to call it multiple times, and client has to cope
with this. (Hmm... I am pretty sure this is what I was thinking about
when composing previous message, but confusing signed __s64 delta
somehow got in, sorry.)

 > >     - dmu internally updates space usage information in the context of
 > >       transaction being synced.
 > 
 > 
 > This is being done per-object already.

Aha, this simplifies the whole story significantly. If dmu already
maintains for every object a space usage counter that is suitable for
the quota, then `pending list' can be maintained by dmu client, without
any additional support from dmu:

    - when (as a part of open transaction), client does an operation
      that can potentially modify space usage it adds object identifier
      into pending list, implemented as a normal dmu object;

    - when disk space is actually allocated (transaction group is in
      sync mode), client gets ->space_usage() call-back as above;

    - on a `mount' client scans pending list object, fetches space usage
      from the dmu, updates client's internal data-structures, and
      prunes the pending list.

Of course, again, pending log has to keep track of objects modified in
the last 2 transaction groups.

With the help of commit call-back even ->space_usage() seems
unnecessary, because on a commit time client can scan pending list (in
memory). Heh, it seems that the quota can be implemented completely
outside of the dmu.

 > 
 > And furthermore, I think this kind of recovery could be better
 > implemented using commit callbacks, which is an abstraction already
 > designed for recovery purposes and which is backend-agnostic.

Sounds interesting, can you elaborate on this?

 > Perhaps I am making concentrating too much on correctness.. maybe going
 > over a quota is not too big of a deal, I remember some conversations
 > between Andreas and the ZFS team which implied that not having 100%
 > correctness is not too big of a problem. О╩©However, I am not so sure
 > about grants.. :/

It's my impression too that the agreement was to sacrifice some degree
of correctness to simplify implementation.

 > 
 > Regards,
 > Ricardo

Nikita.



More information about the lustre-devel mailing list