[Lustre-devel] Replication for NRL/NGA

Tue May 20 16:47:11 PDT 2008

Peter Braam wrote:
>> ps: Nathan, to build the changelog ZFS exploits a tree structure for
>> objects, directories and blocks  the tree structure allows the file
>> system to be searched fast for changes (log based). A missing element
>> is a fast object to path lookup. To get an approximation of the
>> metadata changelog, ZFS would use the difference on changed
>> directories at the beginning and ending snapshot (the tree structure
>> will help you to find pages that have seen insertions and removals 
>> this function would be called zapdiff).
>>     
>
> Hi Nathan -
>
>   
>> At first glance I am interpreting this very similar to the "zfs send"
>> output stream, but the format of the stream would be
>> 1. a fixed user API
>>     
>
> Hmm, don't understand this part.
>   
I meant a stream that a user can read/interpret, as opposed to a closed 
proprietary form that only "zfs receive" can understand.
>   
>> 2. include full path names (or enough info to generate full path names)
>> The stream would then be passed to a userland replicator (our current
>> replication plan, and not "zfs recv")
>>     
>
> Yes, including policy processing, like only syncing certain subtrees.
>
>   
>> Is that about right? So we're just moving the MDT changelog generating
>> part into ZFS
>>     
>
> Yup, but careful, this is a changeset (not an ordered log) but with
> snapshots and you can change it into some kind of log that performs the same
> changes.
>   
Right, it is a set of deltas between two snapshots, not a series of 
steps from A to B.  Once again, this makes things easier for us, because 
we don't care about intermediary states; we can just look up "original 
filename" and "final filename" for all changed objects.
> , and assuming data changes are reflected in mtime updates
>   
>> on the MDT's znodes (i.e. we still are only paying attention to the
>> MDTs, and not the OSTs).
>>     
>
> We use the same mechanism to make an OST change set.
>   
I'm not sure we ever got this straight between us: I was (am) planning 
on using the SOM feature to give me solid mtime data on the MDT, for any 
OST writes.  Thus I see no need to involve changelogs on the OSTs at 
all.  I just do an efficient copy (rsync) of my modified files list 
(from the MDT), and all is good.  (Yes, we could do a more efficient 
copy of only changed data blocks with the OST data, but is this worth 
the extra synchronization effort?)
>   
>> And for the efficient pathname generation, the plan would still be a
>> (fid,name,parent list) database on the MDT, or something new / ZFS
>> specific? I haven't really dug into ZFS much, but I assume we could go
>> back to the "store parent znode in file EAs, store dirname in dir EAs" idea.
>> The snapshots give us a way to avoid the dynamic "current path" issue,
>> so this would be a little easier.
>>     
>
> Jeff Bonwick has extremely clear ideas about how he wants to do this (email
> him and cc me, he'll explain, should he miss this line here).
>   
looking forward to it.
>
>   
>> But a big question is are we delivering zfs-based Lustre this fall? Not
>> that I know anything about it, but aren't there licence problems with
>> zfs and Linux?
>>     
>
> My proposal is that we demo ZFS replication first and then put it in Lustre
> (and pNFS etc).
>   
I'm going to let Bryon sell that bridge.
> BTW, we discussed other exciting things, namely that ZFS can just do the
> rollback for CMD and that it can do metadata only snapshots to avoid
> consuming lots of free space with the snapshotting of data, 
although presumably we're doing small incremental snapshots and erasing 
them when done; shouldn't be too big in general.  I suppose we can 
always come up with a pathologic case.
> and Jeff even
> came up with an idea to not snapshot at all but retain a few transactions to
> roll back to.
>