[Lustre-devel] Lustre HSM vs ADM HSM

Wed Jul 16 11:10:03 PDT 2008

During the last few months we have received a few questions about our HSM
architecture.  This email explains the relationship and differences between
an Open Source HSM (ADM from Sun) and the Lustre HSM design (which was done
between CEA and some folks from the Lustre team).  Likely this discussion
applies equally well when comparing Lustre with other HSMs.  Many thanks to
Rick Matthews for educating me patiently.

The conclusions at the end of this email may be of particular interest.

References: 

Lustre HSM architecture:
http://arch.lustre.org/index.php?title=HSM_Migration
High level design: https://bugzilla.lustre.org/attachment.cgi?id=16341
CEA slides: attached (please post in a findable place on wiki.lustre.org)
ADM: http://opensolaris.org/os/project/adm/WhatisADM/

Each of the following sections describes how elements of the Lustre HSM
architecture relate to the ADM architecture

Event management

HSM¹s capture events in the file system for the following purposes:

1. archiving a file
2. restoring a file
3. policy management, such as purging less used files

A general mechanism to manage such events is provided by DMAPI which
effectively burdens a user space daemon to manage the events. Lustre chose
not to use DMAPI.

Lustre has multiple servers and events can be detected both on OSS and on
MDS nodes.  Detecting file IO on OSS nodes allows one to know exactly what
region of a file will be read/written and adjust actions accordingly (e.g.
If the first 4K of a purged file were left on the disk and that block is
read, no HSM restore action is needed).

Events are detected by initiators and logged transactionally, so that in
case of a power outage no scans of the file system are required. Note that
ADM is targeting ZFS DMU data stores, and in these stores searches in the
file tree can dynamically generate event logs similar to Lustre logs (Lustre
on ZFS will use these searches and abandon its own logging system).   A
precise mechanism is required to determine which search results are still
relevant.

Typically many events might be generated by a single system call, and
handling this is called filtering.  Some filters are always valuable, some
filters are user policy (e.g. A decision not to archive mp3 files).   The
initiators will filter always un-necessary events, such as multiple clients
triggering a restore on the same object, and send events to coordinators.
The coordinators can be a (failover) collection of load balancing systems,
organized in such a manner that events for one file all reach the same
coordinator.

The coordinator applies further filtering, for example if multiple OSS nodes
request the restore of one file the coordinator will make sure that this
leads to one action.

The coordinator will also implement the optional filters arising from
policy.   The discussions on the lustre-devel mailing list have mentioned
that we would simply extend the policy options by adding it to the
coordinator(s).  This is not sufficient.

Coordinators dispatch events for HSM action to agents.  Lustre allows
multiple agents to collaborate on one event and the coordinator observes
completion by each agents.  Agents in turn invoke archiving tools, which
might run in user space, to move files to and from the file system, and
agents can also abort on-going actions at the request of a coordinator.

Another way in which agents can be used is to deliver events to an event
manager for a system like ADM.  This was not previously considered in the
Lustre architecture, but it seems to be a natural way to couple the two
systems.  If the coordinators handle separate subsets of the file system in
a load balanced manner (e.g. By hashing the fids) this might be a good way
to horizontally scale ADM.

Some events need synchronous handling  principally to restore files.
Lustre has not addressed how the adaptive timeout system in Lustre works
with HSM to enable client wait until file restoration is complete.  (For
another timeout issue see space management below.)

Omissions in the Lustre event management architecture are  that (A) no
mechanism was introduced to deliver the policy to all the coordinators
(through the Lustre management server probably, using the standard
configuration lock callbacks), and that (B) no attempt was made to define a
policy language with a user interface.

Lustre¹s logging of events seems very desirable, but the cancellation of log
entries when no longer requires has not yet been architected (and there are
other consumers of the logs).

Strengths of this architecture are: transactional management of events to
eliminate scanning, kernel based filtering close to the source and
scalability through all elements of initiators, coordinators and agents.

The ADM architecture has an DMAPI event handling mechanism, which is
effectively single node, and did not address the multi server issues yet.
ADM has a clear interface for managing policy, but the details for a policy
language remain under discussion.  ADM stores some events in a database in
user space.  Lustre retains them in the kernel generated logs.

HSM metadata

After discussion on this list we have decided to implement minimal metadata
in the file system to locate archived copies and manage copy-in.   This is
not different from ADM in its design now, and allows for flexible management
of storing multiple versions, and for searches among HSM objects to be
performed in a database with suitable indexes (instead of through file
system scans).

The key bits of the attributes are:

1. an indicating that there is a copy in the archive
2. an indication that the copy in the archive is current
3. an indication that the file is being restored
4. an offset indicating what extent of the file has already been restored
5. file size and disk usage for use by stat(2) while the file is in the
archive

The HSM database will hold attributes to:
1. dates, owners etc associated with the file for HSM policy (see below)
2. map a lustre FID to a primary HSM object
3. a list of other HSM objects associated with the FID, which can be copied
back into a new file in the file system.
4. striping metadata for use by restores like in (2) or bare metal restores

This is similar among the two architectures, but Lustre will store this
metadata transactionally on the node that manages the object to which the
metadata is attached.

Policy  small files

The ADM policy manager (in user space) can instruct the work list generator
to build an archive for a collection of small files.  The archive can be
transferred to tape, instead of the individual small files.  In the HSM
database FIDS for many small files will point to one object in the archive.
Each of the FIDS in the file system will get its own EA to indicate that the
file is archived; this is an undesirable issue as it involves a secondary
update of the small files.

Lustre currently has no design to build a collection of small files.  One
key issue with this is that the list needs to be retained until sufficiently
many small files are available to form a good size archival file  doing
this in-kernel may be problematic.

Archiving many small files can be an Achilles heel for HSM systems.  Lustre
can distribute archival associated with rapid small file creation over many
coordinating nodes.   A single HSM database cannot scale in performance to
keep up with multiple metadata servers creating new archival events, but
multiple independent databases could easily be used and clustered as the
coordinators do themselves.

If Lustre is coupled to ADM by delivering events to the ADM event manager
from a Lustre agent, then small files can be managed by ADM.

Kernel based mechanisms to form filesets of small files to be archived might
prove much more efficient.  For Lustre¹s file set architecture see
http://arch.lustre.org/index.php?title=Fileset

Policy  space management

Lustre plans ad-hoc policy for space management.  Based on a scan or on
least-recently-used kernel log files, files can be selected for purging.
Note that such scans rarely need refreshing  a single scan should yields
candidates for purging for a long time.  The log is efficient to maintain
and ZFS can likely search through its object tree to produce such lists.

A better implementation would allow more flexible policies for space
management to be expressed, using the policy language into this.

Both lustre and ADM will migrate files typically before they need to be
purged  this eliminates most performance issues during space management.
Both systems have low and high water marks.  However, when the file system
is really full the space manager may have to invoke archiving.  This should
be done by asking the coordinator to archive certain files.

Policy  HSM side

A system like ADM will have interfaces to request action on the archived
objects  e.g. Remove objects archived before 2002.  Lustre did not consider
such policies, as it intentionally does not want a tight coupling to any
particular HSM.  This decisions remains fine.

An administrative interface is present in both systems to pre-stage files
based on a work list.  This is to populate the file system with restored
copies of all files required by certain jobs.  A language to express
pre-stage lists is desirable (including user friendly syntax to state
³restore all files in this directory²).

Conclusions

1. There are a few issues with both architectures.  I will not speak for the
ADM project here. 
2. There is an excellent opportunity to couple Lustre¹s HSM with ADM
3. There is a interim, much simpler HSM architecture for Lustre that can
work with ADM  see separate future email to this list
4. Lustre should address small file handling when not coupled to the ADM
policy manager. 
5. Lustre should define a policy language in relation with filtering and
space management 
6. Lustre should define a central way to dispatch policy (from the MGS)
7. Lustre should enable adaptive timeouts to assist with ³restore on demand²
and ³space management requires archiving² events.
8. Lustre should define release events for log entries and also manage log
entry cancellation in the presence of many consumers.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080716/4c884b27/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Lustre HSM v6.ppt
Type: application/octet-stream
Size: 965632 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20080716/4c884b27/attachment.obj>