[Lustre-devel] Completion callbacks

Peter Braam Peter.Braam at Sun.COM
Tue Aug 12 20:24:33 PDT 2008


...snip....
> 2. EQ callbacks.
> 
> EQ callbacks are currently serialised by the single global LNET lock.
> Holding the lock for the duration of the callback greatly simplifies
> the LNET event handling code, but unfortunately is open to abuse by
> the handler itself (see below).
> 
> The change we're considering is to use one lock per EQ so that we
> get better concurrency by using many EQs.  This avoids complicating
> the existing EQ locking code, but it does require Lustre changes.
> However making Lustre use a pool of EQs (say 1 per CPU) should be a
> very simple and self-contained change.

This doesn't sound so attractive.  Isn't it possible to hide this under the
LNET API?

Peter


> I'd appreciate any feedback on these suggested changes.
> 
> --
> 
> While I'm talking about EQ callbacks, I _do_ think there is still a
> need to restructure some of Lustre's event handlers.  EQ callbacks are
> meant to provide notification and nothing else.  Originally they could
> even be called in interrupt context, so all you are supposed to do in
> them is update status and schedule anything significant for later.
> Nowadays, EQ callbacks are only called in thread context, but the
> general guidance on doing very little apart from notification remains.
> 
> Except things are never quite as black-and-white as that.  For
> example, request_out_callback() has always called
> ptlrpc_req_finished() - and whereas this most usually decrements a
> refcount and maybe frees the request, during shutdown this can
> actually recurse into a complete import cleanup.
> 
> This blatant flouting of the EQ callback rules has never been fixed
> since it didn't actually break anything and didn't hurt performance
> during normal operation.  However problems like this can be compounded
> as new features are developed (e.g. additional code in these callbacks
> to support secure ptlrpc).  So I think it's time to review what can
> happen inside the lustre event handlers and consider what might need
> to be restructured.  Even with improved EQ handler concurrency, you're
> still tying down an LNET thread for the duration of the EQ callback
> with possible unforseen consequences.
> 
> For example, some LNDs use a pool of worker threads - one thread per
> CPU.  If the LND assigns particular connections to particular worker
> threads (e.g. socklnd), none of these connections can make progress
> while the worker thread is executing the event callback handler.
> 
>     Cheers,
>               Eric
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel





More information about the lustre-devel mailing list