[Lustre-discuss] Extent Based Locking Implementation

Wed Nov 25 14:16:37 PST 2009

On 2009-11-19, at 14:49, Arifa Nisar wrote:
> I have a question regarding implementation of server-based extent  
> locking at Lustre. I have a situation where two processes are  
> concurrently accessing one I/O server for writing one stripe at a  
> time. Both of the processes are writing alternate stripes stored on  
> that server. I want to understand how extent based locking protocol  
> will work in this situation?
>
> I understand first process will be given lock to all the stripes.  
> What will happen when second process sends a lock request? Will I/O  
> server revoke all the (unused/ un asked for) locks back from  
> processes 0, or it will revoke the locks to the required stripe(s)  
> only?

Partly it depends on how large the regions S1 and S2 are, and whether  
they reside on the same OST or not.

> Please explain if P0 and P1 requests locks to stripes S0 … S8 in  
> this order.
>
> P0    S0
> P1    S1
> P0    S2
> P1    S3
> P0    S4
> P1    S5
> P0    S6
> P1    S7
> P0    S8

For example, if the stripe size = 1MB (so S_even is on OST0 and S_odd  
are on OST1), and the IO size is also 1MB from each client, then P0  
will get an exclusive lock on OST0's object and P1 will get an  
exclusive lock on OST1's object, and there is no contention.

Note that the Lustre DLM locks are held by nodes and not processes.   
If P0 and P1 are on the same node, then that node will get all of the  
locks and there is also no contention (writes are serialized by the  
local kernel inode->i_mutex).

Now, with those cases aside, an interesting situation arises when  
there is only a single stripe involved (or there are more processes  
than stripes, or the IO is not stripe aligned), and there are two  
different client nodes invloved.

In that case, the extent lock will only be grown to match the largest  
uncontended extent on the object.  Unfortunately, with 2 nodes  
contending, the lock holder will only have a "lower" extent held, and  
that still means that the next lock requester will get the "higher"  
extent, all the way to ~0ULL.

We've discussed changing this at times to accumulate the number of  
conflicts for some short time, so that it can detect the 2-node ping- 
pong case and not bounce the lock back and forth.

> Does algorithm remains same if number of processes increases beyond  
> two?

No, if there are more clients contending for the lock the heuristic  
also changes.  In the case of > 4 clients contending for locks the  
lock will not be grown downward, only upward.  With > 32 clients  
contending for the lock, the locks will not be grown to more than 32MB  
in size (if lock request is smaller than 32MB).

Also, if a lock is highly contended it is possible to force the  
clients into "nolock" mode, so that the OST is doing the locking on  
behalf of the client, in order to avoid lock ping-pong.  Tunables for  
this are:

/proc/fs/lustre/ldlm/{OST}/contended_locks
- number of lock conflicts before a lock is contended (default 4)
/proc/fs/lustre/ldlm/{OST}/contention_seconds
- seconds to be "conflicted" state until normal locking (default 2s)
/proc/fs/lustre/ldlm/{OST}/max_nolock_bytes
- largest enqueue to return conflicts on (default = 0 = off)

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.