[Lustre-devel] SAM-QFS, ADM, and Lustre HSM

Colin Ngam Colin.Ngam at Sun.COM
Mon Feb 2 06:56:15 PST 2009

Harriet G. Coverston wrote:

> Nathan,
> On Jan 30, 2009, at 6:21 PM, Nathaniel Rutman wrote:
>> LEIBOVICI Thomas wrote:
>>> At CEA, we are using our own copytool that directly uses HPSS API. 
>>> This already exists and is in production for years.
>>> I think there will be few modifications to adapt it to Lustre-HSM 
>>> purpose
>>> (basically, add fid <-> HSM id mapping and backup of attributes, 
>>> path, stripe...)
>> So then the QFS copytool will indeed be a new tool, and should be 
>> scheduled accordingly.
>> Features:
>> 1. "cp --preserve" like functionality (include metadata attributes in 
>> cp)
>> 2. add EA's (create mini-tarball)
>> 3. implement FID hash to subdivide namespace
>> 4. periodic status reporting (via ioctl on file)
>> Harriet G. Coverston wrote:
>>>> There is a mechanism to get the current full pathname for a given 
>>>> fid from userspace, so an HSM-specific copytool could find it out, 
>>>> but a central tenet of the design here is that as far as the HSM is 
>>>> concerned, the entire Lustre FS is a flat namespace of FIDs.
>>> Be careful here. We are a file system. We don't have a limit on # of 
>>> files in one directory, but we don't recommend more than 500,000 
>>> files in one single directory or you will start to see some 
>>> performance problems. You will have to create a tree, not use a flat 
>>> namespace.
>> Yes, a tree based on a hash of the fid.
>> The other option is to use the actual filename for storage, but from 
>> Lustre's point of view this gets extremely tricky.  For example:
>> Send /foo/bar to archive.  Client A opens /foo/bar.  Client B renames 
>> /foo/bar to /abc/xyz, but this change hasn't propagated to the 
>> archive yet.  Client A now tries to read its open file handle, which 
>> tells Lustre to read the offline file FID 123, which it translates to 
>> /abc/xyz currently, which the archive doesn't know about yet.  Not 
>> just xyz, but renames on any ancestor path element cause similar 
>> misses.  Since the FID remains constant throughout the life of a 
>> file, we don't have to worry about any namespace changes (file or 
>> parents).  If there was an alternate way of bypassing the archive's 
>> namespace to directly access a file, we could conceivably store e.g. 
>> an archive-specific identifier within the Lustre stripe EA, and pass 
>> this down to the copytool when reading an offline file, but this 
>> presupposes that such a thing exists, is of reasonable size, has a 
>> userspace method to access it, etc.
> Yes, we have a FID like concept in SAM-QFS. It is called the file ID. 
> It is 64 bits and consists of the inode/generation number. It is 
> unique. You can store it. You can issue an ioctl to open the ID. You
> can issue an ioctl to do an ID stat, etc. It is much more efficient 
> than using the filename (expensive lookup). This means if you store 
> and use the ID, you can cover the rename window and still be 
> guaranteed that you will get the right file. Note, we don't rearchive 
> on a rename.
I believe this facility only exist on the Meta Data Server Node and not 
on the Linux/Solaris clients.  Am I correct?


> I really think a replicated namespace will be much more intuitive and 
> solves restore. If you prefer
> to build a tar container, that is OK, too. The tar file can have a 
> suffix and then you know it is tar and
> you can tar it back.
>>>> You can get a full pathname if you want to for catastrophe 
>>>> recovery, but Lustre itself will only speak to the HSM with FIDs.
>>>> As I said in the other email, although SAM-QFS can do name-based 
>>>> policies, the "name" as far as QFS is concerned is just the FID, 
>>>> so  name-based policies at the copytool level are worthless.   
>>>> Unless we a.) add the path/filename back to the file (EA, or use a 
>>>> tarball wrapper), and b.) modify the SAM policy engine to use the 
>>>> "real" path/filename instead of the FID.
>>> Currently, we don't support policy using EA (extended attributes are 
>>> in 5.0). We have had lots of requests for this, especially from our 
>>> digital preservation customers.
>> Ah, policy based on EAs would be the general case, yes.
> Yes, this would be a nice feature for us.
>    - Harriet
> Harriet G. Coverston
> Solaris, Storage Software             |  Email: harriet.coverston at sun.com
> Sun Microsystems, Inc.                          |  AT&T:  651-554-1515
> 1270 Eagan Industrial Rd., Suite 160       |  Fax:   651-554-1540
> Eagan, MN 55121-1231

More information about the lustre-devel mailing list