[Lustre-devel] NRS HLD

Thu Jul 7 07:34:02 PDT 2011

Hi Liang,

Thank you for sharing these great ideas; I think the task lends itself
well to adding different schemes that performance can benefit from. To
answer an earlier question, we unfortunately don't have any numbers or
have performed any tests related to the NRS task. We'll obviously
publish any such data as soon as we get our first policy developed, and
under test.

Both Nathan and I think you raise a very good point by highlighting the
need for a scalable data structure that can sort elements in libcfs, as
different policies could probably make use of such a facility, and may
actually require it if they need to address scalability-related issues
(there's an older, related topic
http://lists.lustre.org/pipermail/lustre-devel/2008-January/002230.html
which I'm sure you're already aware of).

I think there are some holes in my understanding of the example you gave
on using the binary heap in conjunction with GRN/CRN/RRN to implement a
CBRR policy. Is the GRN incremented with every incoming request? And how
can two requests end up with the same RRN? I'm sure the logic makes
sense, this is probably me just being daft :-).

Can I ask if there is a binary heap implementation living in a branch
somewhere, or you plan on landing one on master at some point? There is
some code attached on the thread I have pasted above, but I have not had
a look through it; going through the thread posts, it seems to be in
working and well-performing state.

For the first policy we aim to develop ("obj_extents" in the HLD)
although OBRR-like, we are not looking at using any data structures that
are out of the norm, at present. The idea is that we 
pick up requests from the service in-queue and add requests in
number-limited (or perhaps size-limited) per-object groupings, sorted by
offset, which are either populated in-place in the normal service
request queue, or use the normal service request queue as an
intermediate queue (just to avoid much processing in what is now the
pre-processing stage for RPCs), before they are added to the sorted
queue, from where they are eventually dispatched. We are looking at
using a libcfs hash for resolving incoming requests to the appropriate
object grouping, much like the OBRR implementation in the prototype NRS
patch does; I think this method will naturally avoid any request
starvation issues, but I am not sure it will perform as well as an OBRR
implementation using a heap. Hope the description above makes some
sense; if you see any problems with it, then please do let us know.

The NRS architectural specification at
http://wiki.lustre.org/index.php/Architecture_-_Network_Request_Scheduler has references to additional aspects to the task. Are you looking at providing some form of POP capability, or reproducing offset stream ordering, at present? These both sound like features that perhaps the framework itself could benefit from, at least in part irrespective of policy, I think. We seem to have at least one use case where we could do with clients passing additional information with each RPC, in QoS, so it seems like resource control, and catering for offset stream consistency may add additional ones.

Regards,
Nikitas

On Tue, 2011-07-05 at 17:20 +0800, Liang Zhen wrote: 
> Hi Nathan,
> 
> We also have some ideas about infrastructure of NRS and would like to share
> them at here for discussion:
> 
> - a common library to support inserting/removing prioritized requests, which 
>   should be scalable even for millions of requests
> 
> - each policy only needs to provide:
>   . private data ("heap", more details below)
>   . a sorting callback to prioritize requests
> 
> By this way, most common code can be put into libcfs and ptlrpc modules,
> instead of having a whole bunch of policy functions to sort requests in
> each policy (like the patch on BZ 13634).
> 
> Here is the concept of "heap" mentioned above:
> 
> (it's quoted from the following link)
> A heap is a specialized tree-based data structure that satisfies the heap
> property: if B is a child node of A, then key(A) ≥ key(B). This implies that an
> element with the greatest key is always in the root node, and so such a heap is
> sometimes called a max-heap. Alternatively, if the comparison is reversed, the
> smallest element is always in the root node, which results in a min-heap.
> 
> Please see more detail at:
> http://en.wikipedia.org/wiki/Heap_%28data_structure%29
> 
> Heap can give (log N) for both inserting element and removing element with
> greatest/lowest key, so it's very scalable, here are some sample heap APIs
> from a simple prototype:
> 
>     /* create a heap, @compare is callback for sorting elements */
>     cfs_binheap_t *cfs_binheap_create(cfs_binheap_cmp_t compare,
>                                       unsigned int flags);
> 
>     /* destroy a heap */
>     void cfs_binheap_destroy(cfs_binheap_t *h);
> 
>     /* insert @element into heap */
>     int cfs_binheap_insert(cfs_binheap_t *h, void *elelment);
> 
>     /* pop root for the @heap, which has greatest/lowest key in @heap */
>     void *cfs_binheap_pop_root(cfs_binheap_t *heap);
> 
> and an example of using "heap" to implement Client-Based Round Robin:
> 
>   - allocating a "heap" for the service which needs CBRR
> 
>   - the service also have a "global round-number" which is always incremental
>     . GRN: Global Round Number
>     . CRN: Client Round Number
>     . RRN: Request Round Number
> 
>   Process of inserting a request
>   - when there is new request:
>     . if the incoming request is the first request for a client, assign
>       the GRN to both the RRN of the request and CNR of the client
> 
>     . if the incoming request isn't the first request for a client,
>       increase the CNR on the client then assign it to RRN of the request.
> 
>   - insert the request into the heap, the heap can sort requests based
>     on callback like this:
>     . primary key is RRN of request
> 
>     . secondary key is client ID (i.e: nid of client), secondary keys are
>       compared only if primary keys are equal
> 
>   Process of dequeue a request:
> 
>   - pop the root element from the heap, which should always has the lowest RRN
> 
>   - if the RRN of the request is larger than GRN of the service,
>     which means the service has already handled all requests with <= GRN and
>     should set the GRN of service to the RRN of request
> 
> Again, this just an example to demonstrate how to use "heap" to implement CBRR
> and show some preliminary ideas, it could be helpful to raise some discussion
> at here and to see whether it's possible to improve and apply this to NRS and
> make it to be a common library for future implementation of OBRR or any
> policy needs to prioritize requests, or for QoS.
> 
> Thanks
> Liang
> 
> On Jul 2, 2011, at 11:39 PM, Nathan Rutman wrote:
> 
> > I think there is a whole slew of interesting possibilities to try out once an infrastructure for implementing different policies is in place. We're trying to get that infrastructure established first; we will be able to switch between policies dynamically to see their effects and optimize performance. People will be able to easily add new policies to test. 
> > You raise a good point about the SMP scaling patches; we will have to take a close look to make sure we don't undo any of the work you've done there. 
> > 
> > On Jul 2, 2011, at 3:49 AM, "Liang Zhen" <liang at whamcloud.com> wrote:
> > 
> >> Hi Nikitas,
> >> 
> >> It's a very interesting document, thanks for sharing these great ideas.
> >> May I ask whether there are any tests/numbers based on OBRR description
> >> in HLD? We always want to see numbers from these algorithms, unfortunately
> >> there is no reliable testing result from the patch on BZ 13634.
> >> 
> >> We are actually considering about NRS as well, although we don't have a
> >> formal design document like this HLD, but we'd like to share a few
> >> preliminary ideas for discussion:
> >> 
> >> - Target Based Round Robin (TBRR) probably is something worth to have
> >> 
> >> Briefly, it's just load balancing over OSTs to improve overall server
> >> performance.
> >> 
> >> - Fairness for clients and resource control for jobs
> >> 
> >> I think they are similar to CBRR and UBRR described in your document
> >> though I didn't see too much detail about them in the HLD.
> >> Personally, I think they are important and we probably will do some tests
> >> for CBRR survey very soon.
> >> 
> >> . Client-Based Round Robin (CBRR) can guarantee the server responses
> >>   to all clients fairly, and get whole-cluster load balancing, improve
> >>   concurrency of clients and jobs, and get better overall performance.
> >> 
> >> . resource control for jobs, some users complained that a busy job will
> >>   hog all resources on servers, and make the cluster not usable for other
> >>   control command or sysadmin. So it might be helpful to support job
> >>   resource control inside NRS.
> >> 
> >> - Layering NRS polices
> >> 
> >> Of course, OBRR is very important policy for NRS, but it might be
> >> helpful to have multiple polices working at the same time, i.e:
> >> 
> >> . bind OSTs on CPU partitions on NUMA system(please see more detail at
> >>   http://jira.whamcloud.com/browse/LU-56)
> >> 
> >> . Service threads on each CPU partition can do TBRR for bound OSTs. If
> >>   there is no CPU partition (like current Lustre) or OSTs are not bound
> >>   on CPU partitions, service threads just do round robin over all OSTs.
> >> 
> >> . OBRR inside each OST
> >> 
> >> . of course, these layers should be tunable.
> >> 
> >> - Overhead of layerd polices
> >> . there definitely will be some overhead for inserting/removing
> >>   request from these queues (or whatever), so we want some very scalable
> >>   algorithms to implement these polices.
> >> 
> >> Again, these are just some preliminary ideas, so we would appreciate any
> >> comment and suggestion.
> >> 
> >> Thanks
> >> Liang
> >> 
> >> 
> >> On Jun 30, 2011, at 11:49 PM, Nikitas Angelinas wrote:
> >> 
> >>> Hi,
> >>> 
> >>> There is a slightly more up-to-date version of the HLD, which I am
> >>> attaching.
> >>> 
> >>> 
> >>> Thanks,
> >>> Nikitas
> >>> 
> >>> On Wed, 2011-06-29 at 12:55 -0700, Nathan Rutman wrote:
> >>>> Sharing with the community.  All comments welcome.
> >>>> This HLD (high-level design) for a Network Request Scheduler is more
> >>>> about infrastructure than algorithm.
> >>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> We're actually in the DLD (detailed-level design) stage at the moment
> >>>> (sorry it didn't occur to me to 
> >>>> post this earlier).  I'll post the DLD after some minor revision.
> >>>> _______________________________________________
> >>>> Lustre-devel mailing list
> >>>> Lustre-devel at lists.lustre.org
> >>>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >>> 
> >>> 
> >>> ______________________________________________________________________
> >>> This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> >>> 
> >>> Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> >>> 
> >>> Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> >>> 
> >>> The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> >>> ______________________________________________________________________
> >>> 
> >>> 
> >>> <HLD_of_Lustre_NRS.pdf>_______________________________________________
> >>> Lustre-devel mailing list
> >>> Lustre-devel at lists.lustre.org
> >>> http://lists.lustre.org/mailman/listinfo/lustre-devel
> >> 
> >> _______________________________________________
> >> Lustre-devel mailing list
> >> Lustre-devel at lists.lustre.org
> >> http://lists.lustre.org/mailman/listinfo/lustre-devel
> > ______________________________________________________________________
> > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.
> > 
> > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.
> > 
> > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> > 
> > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
> > ______________________________________________________________________
> > 
> > 
> 

______________________________________________________________________
This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it.

Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People's Republic of China and Xyratex Japan Limited registered in Japan.
______________________________________________________________________