[Lustre-devel] layout lock / extent lock interaction

Fri Mar 6 14:59:58 PST 2009

Andreas Dilger wrote:
> On Mar 06, 2009  10:01 -0800, Nathaniel Rutman wrote:
>   
>> I think we need to explicitly list the extent / layout lock interactions  
>> so we don't miss anything in the implementation:
>> 1. Create
>>
>>    * MDT generates new layout lock at open
>>    * client gets Common Reader layout lock
>>    * client can get new extents read/write locks as long it holds CR
>>      layout lock
>>
>> 2. Layout change
>>
>>    * MDT takes PW layout lock, revoking all client CR locks
>>    * in parallel, MDT takes PW lock on all extents on all OSTs for this
>>      file
>>    * Clients drop layout lock and requeue
>>    * Clients flush cache and drop their extent locks
>>    * MDT changes layout
>>    * MDT releases layout lock and extents locks
>>    * Clients get CR layout lock and can now requeue their extent locks
>>
>> 3. Client / MDT network partition
>>
>>    * client can continue reading/writing to currently held extents
>>    * when client determines it has been disconnected from MDT it drops
>>      layout lock
>>    * client can't get new extent locks, but can continue writing to
>>      currently held extents
>>    * if MDT changes layout, it first PW locks all extents, causing OSTs
>>      to revoke client's extents locks
>>    * Client must requeue layout lock before requeueing extents locks
>>
>>    What if client hasn't noticed it's been disconnected from the MDT by
>>    the time it tries to requeue extent locks?  It doesn't know that the
>>    layout lock its holding is invalid...
>>     
>
> That is a thorny problem.  I'll go through several partial solutions
> and see why they do not work, then hopefully a safe solution at the end.
>
> One possibility is that the AST sent to the clients during the extent lock
> revocation would contain a flag that indicates "the layout is changing"
> (similar to the truncate/discard data flag), so the clients get notified
> even if disconnected from the MDS.  It still isn't enough, however,
> as the clients will only get this AST if they currently have an extent
> lock, and it isn't always true.
>   
How about if we introduce the concept of a layout generation?  The 
generation is stored in the layout and also with each OST object.  When 
the MDT takes the extent locks it sends the new generation to the OSTs.  
Clients send the layout generation along with any extent lock enqueue.  
The OSTs only grant extents to clients that match the current 
generation.  Maybe "match or exceed" in case OST dies before new gen can 
be recorded.  And OST increases gen to latest seen whenever any (MDT or 
client) extent lock is enqueued.
> A second option is in case a client holding a layout lock is evicted AND
> the layout is being changed then the MDS can't release the extent locks
> until at least one ping interval (assuming any still-alive client would
> have detected this and try reconnecting).  This is also not 100% safe because
> the client might have been evicted moments earlier due to some other lock
> and the "wait for one ping interval" heuristic would no longer apply.
>
> We cannot depend on the layout change to be drastic and the objects would
> no longer exist to be written to (CROW issues aside).  If we are changing
> the layout to add a mirror that wouldn't help and we would now have
> inconsistent data on each half of the mirror.
>
> Another option is something like "imperative eviction" so that clients
> being evicted are actively told they are being evicted, but that has
> the issue of the "you are evicted" RPC will normally be sent to a node
> which is already dead and slow down the MDS and/or block all of its
> LNET credits so isn't really even a usable option.
>
>
> A safe option (AFAICS) is to have MDS eviction force OST eviction (via
> obd_set_info_async(EVICT_BY_NID).  That would also resolve some other
> recovery problems, but might be overly drastic if e.g. the client is
> being evicted from the MDS due to router failure or simple network
> partition.  Having a proper health network and also server-side RPC
> resending would help avoid such problems.
>   
This is drastic, but on the other hand we only need to do this if the 
layout is being changed.  Of course, since eviction would happen before 
layout change we would need to remember who was evicted and hasn't 
reconnected...
> This is one of the main reasons why having DLM servers on one node
> controlling resources on another node is a bad idea.  We had similar
> issues in the past when we locked all objects via the OST only on
> stripe index 0, and we might have similar problems with subtree locks
> in the future with CMD or any SNS RAID that is only locking a subset
> of objects.
>