[lustre-devel] [PATCH 1/6] Autoconf option for rate-limiting Quality of Service (RLQOS)

Mon Apr 17 09:46:03 PDT 2017

On 04/17/2017 05:32 AM, Brinkmann, Prof. Dr. André wrote:
> I fully agree that your approach to learn a small rule-set is very interesting to optimize overall 
> Lustre bandwidth. What I have not been able to fully understand from your paper is the cost of 
> adaptation. What is happening in a cluster running many jobs at the same time applying very different 
> access patterns (in very different combinations to different OSSes)?

When there are many jobs, their aggregated  I/O pattern can usually be
treated as a mixed random read/write workload. The more jobs you have,
the more uniformly random the I/O pattern is. My experience is that they
are not that hard to optimize. The hardest to optimize are when only one
or two I/O job is running and they have a very special I/O pattern.

> We have just started to collect these patterns. Might be interesting to apply different (machine learning)
> algorithms on top of these patters going into different directions:
> 
> - Optimize overall bandwidth (like ASCAR is doing)

This is similar to what I'm working on. I've been systematically testing
many machine learning algorithms on bandwidth optimization, and some of
them have pretty good results. My problem is that all my workloads so
far are synthetic.

> - Optimize bandwidth while supporting QoS rules for certain
> applications

This is on my radar. I'll look into your design and implementation to
see how we can do something interesting together.

> Will you be at LUG? At least Tim from our team will participate and it might be a good opportunity to discuss
> a joint approach.

I'm not sure yet. Now I've graduated I need to find my own funding
source for travel.

--
Yan