[Lustre-devel] Lustre HSM HLD draft
Nathaniel Rutman
Nathan.Rutman at Sun.COM
Mon Feb 11 12:33:03 PST 2008
Aurelien Degremont wrote:
> Nathaniel Rutman a écrit :
>
>> 5.1 external storage list - is this to be stored on the MGS device or a
>> separate device? If the coordinator lives on the MGS, why not it's
>> storage as well? In any case, it should be possible to co-locate the
>> coordinator on the MGS and used the MGS's storage device, in the same
>> way that the MGS can currently co-locate with the MDT.
>> How does the coordinator request activity from an agent? If the
>> coordinator is the RPC server, then it's up to the agents to make
>> requests; agents aren't listening for RPC requests themselves.
>>
>
> Presently, it is never said that the coordinator will live on the MGS.
> The Coordinator constrains are:
> 1 - Must receive various migration requests from OST/MDT.
> 2 - Should be able to communicate with Agents and asks them migrations.
> 3 - Should store configuration and migration logs.
>
> I think #1 and #2 are two differents API. The coordinator is clearly a
> RPC server for the first one. How #2 should be implemented is not so
> clear. What would be be the "Lustre-way" here?
>
With userspace servers, presumably we have some way of passing LNET
messages
from kernel to userspace. We should probably still go through LNET for
#2 in order
to use the broadest range of network fabrics. So it could be the same
or similar
RPC. There is no "Lustre-way" for this area - we've never done this
kind of thing before.
> For #3, the few logs that will be backed up here are not huge, and it
> surely could be colocated with another Target, but I'm not sure this
> should be mandatory. This device should be available to several servers,
> for failover like the other Targets. We could imagine having more than 1
> coordinator at long term. I'm not sure it is a good idea to stick it to
> another target.
>
Not mandatory, but possible is nice. Minimize the number of required
partitions.
>
>> 6.3 object ref should include version number. Also include checksum?
>>
>
> For data coherency? Should we add a explicit checksum for those values
> (stored in an EA) or used a possible backend feature (Can ZFS and
> ldiskfs detect EA value corruption by themselves?) ?
>
ZFS can, ldiskfs cannot. Anyhow, it was just a thought. Doesn't hurt
to allow space for it.
>
>> 2.1Archiving one Lustre file
>> There should not be a cache miss when archiving a lustre file; perhaps
>> open-by-fid is intended to bypass atime updates
>> so that the file isn't marked as "recently accessed"?
>>
> > Transparent access - should this avoid modification of atime/mtime?
>
> I would say yes.
>
>
>> 2.2Restoring a file
>> "External ID" presumably contains all information required to retrieve
>> the file - tape #, path name, etc?
>> Once file is copied back, we should probably restore original ctime,
>> mtime, atime - coordinator is storing this, correct?
>>
>
> External ID is an opaque value manage by the archiving tool. If the HSM
> can store a lot of metadata, only a ref is needed, if not, the tool is
> responsible for storing all the data it needs. Anyway, this is totally
> opaque for Lustre.
> I hope the HSMs will not need so many data in this field. HPSS does not
> need so many data, it uses its internal DB to store them. I suppose SAM
> also.
>
What about restore of original ctime, mtime, atime? I think we must
store it
in the coordinator because we must work with all HSMs, and I think it is
important
to restore it.
>
>> IV2 - why not multiple purged windows? Seems like if you're going to
>> purge 1 object out of a file, you might want to purge more.
>> Specifically, it will probably be a common case to purge every object of
>> a file from a particular OST. This is not contiguous in a
>> striped file.
>> I don't see any reason to purge anything smaller than an entire object
>> on an OST - is there good reason for this?
>>
>
> Multiple purged window is subtle. If you permit this feature, you could
> technically have, in the worst case, one purged window per byte, and
> this could be very huge to store. Do you think you will do several holes
> in the same file? In which cases?
>
Like I said, I don't see any reason to purge anything smaller than a
full object; I
would in fact disallow purging of an arbitrary byte range, and only
allow purging
on full-object boundaries.
> In fact, the more common case is to totally purge a file which have been
> migrated on HSM, and it is only an optimisation to keep the start and
> the end of the file on disk, to avoid triggering tons of cache misses
> with commands like "file foo/*" or a tool like Nautilus or Windows
> Explorer browsing the directory.
>
Again, since Lustre is optimized to work with 1MB chunks anyhow, I don't
think
it helps much to keep less than that in the beginning / end objects, so
I would
say just keep the first and last blocks instead.
> The purged window is stored by per object, OST object and MDT object.
> So, if several objects are purged, each object will store its own purged
> window. But the MDT object describing this file will store a special
> purged window which starts at the smallest unavailable bytes and ends at
> the first available one. The MDT purged window indicates "if you do I/O
> in this range, you're not sure the date are there." or "Outside of this
> area, I guarantee data are present."
> Maintain multiple purged windows will be an headache, with no real need
> I think.
> Moreover, people have asked for an OST-object based migration, even if I
> think whole file migration will be the most common case.
>
>
> > If that's the case, then it
>
>> the OST must keep track of purged objects, not ranges within an existing
>> object.
>>
>
> Objects are not removed, only their datas. All metadata are kept.
>
>
>> If the MDT is tracking purged areas also, then there's a good potential
>> synergy here with a missing OST --
>> If the missing OST's objects are marked as purged, then we can
>> potentially recover them automatically from
>> HSM...
>>
>
> What do you call a "missing OST" ? A corrupt one ? A offline one?
> Unavailable?
>
Yes. All of the above. Obviously we need to distinguish between
"permanently
gone" and "temporarily gone".
> Where will you copy back the object data ? On another OST object ?
>
Yes. Some kind of recovery will take place to generate a new object on
a different OST and
we can restore the data there.
> With the purged window on each OST object and MDT and the file stripping
> info, we could easily restore the missing parts.
>
Exactly. This is why I say we should think about this now, to allow for
this
possibility.
>
>> 4.2 How is a purge request recovered? For example, MDT says purge obj1
>> from ost1, ost1 replies "ok", but then dies before it actually
>> does the purge. Reboots, doesn't know anything about purge request now,
>> but MDT has marked it as purged.
>>
>
> The OST asynchronously acknowledges the purge when it is done. The MDT
> marks it purged only when it is really done. I will clarify this.
>
>
>> V2.1 How long does OST wait for completion? Is there a timeout? We
>> probably need a "no timeout if progress is being
>> made" kind of function - clients currently do this kind of thing with OSTs.
>>
>
> I'm sure Lustre already has similar mechanisms for optimized timeout in
> this kind of situation we could reused here.
> What you describe is a good approach I think.
>
>
>> V2.2 No need to copy-in purged data on full-object-size writes.
>>
>
> True. We could had such optimization. But this is only useful for small
> files or very widely stripped ones, doesn't it?
>
No, we very frequently write entire stripes (objects). Lustre clients
can optimize for this.
>
> Thanks for your comments.
>
>
More information about the lustre-devel
mailing list