[Lustre-devel] Moving forward on Quotas
Nikita Danilov
Nikita.Danilov at Sun.COM
Sat May 31 09:19:25 PDT 2008
Ricardo M. Correia writes:
> Hi Nikita,
Hello,
(I reordered some of the comments below.)
> Currently, the space reported by st_blocks is calculated from
> dnode_phys_t->dn_used, which in recent SPA versions tracks the number of
> allocated bytes (not blocks) of a DMU object, which is accurate up to
> the last committed txg.
> Is this what you mean by "space usage"?
I meant a counter of bytes or blocks that this object occupies for the
quota purposes. I specifically don't want to identify `space usage' with
st_blocks, because for the modern file systems there is no _the_ way to
define what to include into quota: users want quota to be consistent
with both df(1) and du(1) and in the presence of features like snapshots
this is not generally possible.
> What do you mean by mount? Do you mean when starting an OST?
Yes, OST or MDT.
> First of all, I think you would need to keep track of objects changed in
> the last 2 synced transaction groups, not just the last one. The reason
Indeed, I omitted this for the sake of clarity.
> group N+1 may already be quiescing. This presents a challenge because if
> the machines crashes, you may lose data in 2 transaction groups, not
> just 1, which I think would make things harder to recover..
Won't it be enough to record in the pending list object from two last
transaction groups, if necessary?
>
> Another problem it this: let's say the DMU is syncing a transaction
> group, and starts calling ->space_usage() for objects. Now the machine
> crashes, and comes up again.
> Now how do you distinguish between which objects were called
> ->space_usage() in the transaction group that was syncing and which
> weren't (or how would you differentiate between ->space_usage() calls of
But we don't have to, if we make ->space_usage() idempotent, i.e.,
taking an absolute space usage as a last argument, rather than delta. In
that case, DMU is free to call it multiple times, and client has to cope
with this. (Hmm... I am pretty sure this is what I was thinking about
when composing previous message, but confusing signed __s64 delta
somehow got in, sorry.)
> > - dmu internally updates space usage information in the context of
> > transaction being synced.
>
>
> This is being done per-object already.
Aha, this simplifies the whole story significantly. If dmu already
maintains for every object a space usage counter that is suitable for
the quota, then `pending list' can be maintained by dmu client, without
any additional support from dmu:
- when (as a part of open transaction), client does an operation
that can potentially modify space usage it adds object identifier
into pending list, implemented as a normal dmu object;
- when disk space is actually allocated (transaction group is in
sync mode), client gets ->space_usage() call-back as above;
- on a `mount' client scans pending list object, fetches space usage
from the dmu, updates client's internal data-structures, and
prunes the pending list.
Of course, again, pending log has to keep track of objects modified in
the last 2 transaction groups.
With the help of commit call-back even ->space_usage() seems
unnecessary, because on a commit time client can scan pending list (in
memory). Heh, it seems that the quota can be implemented completely
outside of the dmu.
>
> And furthermore, I think this kind of recovery could be better
> implemented using commit callbacks, which is an abstraction already
> designed for recovery purposes and which is backend-agnostic.
Sounds interesting, can you elaborate on this?
> Perhaps I am making concentrating too much on correctness.. maybe going
> over a quota is not too big of a deal, I remember some conversations
> between Andreas and the ZFS team which implied that not having 100%
> correctness is not too big of a problem. О╩©However, I am not so sure
> about grants.. :/
It's my impression too that the agreement was to sacrifice some degree
of correctness to simplify implementation.
>
> Regards,
> Ricardo
Nikita.
More information about the lustre-devel
mailing list