[Lustre-devel] WBC HLD outline

Alexander Zarochentsev Alexander.Zarochentsev at Sun.COM
Mon Mar 23 14:58:30 PDT 2009


Hello,

here is a wbc hld outline. 
Please take a look. 

===============================================
WBC HLD OUTLINE

* Definitions

WBC (MD WBC): (Meta Data) Write Back Cache.

MD operation: whole MD operation over an object:
 rename/create/open/close/setattr/getattr/link/unlink/mkdir/rmdir +
 readdir.

Reintegration: The process of applying accumulated MD operation to the
MD servers.

MDS/RAW: MDS API extension to do "raw" fs operations: inserting of a
dir entry w/o creating inode and so.

MD update: a part of MD operation to be executed on one server,
contains one or more MDS/RAW operations.

MD batch: a collection of per-server MD updates.

MDTR: MD translator: translates MD operations into MD/Raw ones.
 
* Requirements

 Client application is able to create 64k files/second.

 Reintergration moves fs from one consistent state to another
 consistent state.

 Non-WBC client support w/o visible overhead.

 Avoid MDS code rewrite if possible.

* Design outline

** Overall picture

[Application]
    |
=syscalls=
    |
    V
  [VFS]
    |
=vfs hooks=
    |
    V
[LLITE/MDC]
    |
 =MD (non-WBC) proto=
    |
    V
[MD CACHE MANAGER] ---> [LDLM]
    |
    V
 [MDTR]
   +-----------+----------+
   |           |          |
  =======WBC proto==========
   |           |          |
   V           V          V
[MDS1/RAW] [MDS2/RAW] [MDS3/RAW]

** WBC 

WBC client has a MDTR running on client side,
it also can be a proxy server, acting as a server for
non-WBC clients and as a client for MD servers.

*** WBC vs non-WBC

Processing MD operation request (lock enqueue + op intent, by Alex
suggestion), MD server may decide to execute it by itself, or grant a
only a lock (subtree one) and allow client to continue in WBC mode.
 
*** Locks

needed LDLM locks are taken before operation starts and held until the
corresponded batch is re-integrated.

*** Local cache management

WBC client executes operations locally, modifying local in-memory
objects. WBC client has a (redo-)log of all operations.

The cache manager controls process of MD cache re-integration. 

*** MDS/RAW operations

Managing directory entries and inodes, without maintaining
fs consistency automatically.

create/update/delete methods for directory entries and inodes.

*** MDTR

MDTR is responsible for converting MD operations into set of
per-server MD/RAW operations.

*** Client re-integration

Periodically, or because of (sub-)lock releasing, dirty memory
flushing or so, WBC client submits batches to all MD servers involved
into the operations.

Process of re-integration is protected by LDLM locks. MD servers are 
updated
using WBC protocol.

*** WBC protocol

WBC request contains a set of MD/RAW operations, tagged with one epoch
number.  Bulk transfers are used.

*** File data
Flushing file data to the OST servers is delayed until file creation
is re-integrated.

*** Recovery

The redo-log preserved until it is not needed in recovery (i.e. epoch
gets stable)

Client replay the log and re-execute all operations from it, repeating 
MDTR processing (dispatching the operation between MD servers).

**** WBC client eviction, uncompleted updates

If client dies until re-integration is completed, there are three 
choices:

a) Cluster-wide rollback, all servers roll back to the last globally
stable epoch, then clients to replay heir redo-logs.

This scenario should be avoided because a single client failure may
may stop whole cluster for recovery.

b) All servers participating in re-integration coordinate to undo
uncompleted updates.

c) The servers have all information needed to complete re-integration
w/o client.

The recovery strategy is a subject of CMD Recovery Design document,
but a possibility of (c) need a support in the WBC protocol.

** non-WBC

*** MD protocol

MD (non-WBC) protocol remains the same as now.

** Use cases

*** WBC / non-WBC decision

1. Check whether server and client can operate in WBC-mode through
connect flags.

2. I they can, a lock enqueue request may contain a request for
WBC-mode, the server may respond with granting WBC-mode and STL or PW
lock on the directory. MD server accepts or rejects WBC-mode request
depending on server rules and per-object access statistics.

*** File creation

client gets a PW lock on directory.

client fetches directory content.

client does file creation locally, in cache, the operation record is
added to the client redo-log.

Another client want to read the directory, lock conflict triggers
re-integration.

MD Cache manager processes the redo-log, prepares batches with MDS/RAW
operations and submits them to the MD servers.

The MD servers integrate the batches.

MD Cache manager frees local cache content and cancels the directory 
lock.

** Questions

Q: Can several wbc clients work in one directory simultaneously?
A: If extent locks for directories are implemented, each WBC client
   can take a lock on a hash interval.

Q: can  wbc clients do massive file creation in one directory
   efficiently?
A: the idea that may help: if we can guess that the file names created
   by a client are lexicographically ordered, a special hash function
   may reduce lock conflicts between clients holding locks on
   directory extents.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems



More information about the lustre-devel mailing list