[Lustre-discuss] Metadata storage in test script files

Tue May 8 06:51:16 PDT 2012

Hi,
On 05/08/2012 01:34 AM, Chris Gearing wrote:
> 
> 
> On Mon, May 7, 2012 at 7:33 PM, Nathan Rutman <nrutman at gmail.com
> <mailto:nrutman at gmail.com>> wrote:
> 
> 
>     On May 4, 2012, at 7:46 AM, Chris Gearing wrote:
> 
>     > Hi Roman,
>     >
>     > I think we may have rat-holed here and perhaps it's worth just
>     > re-stating what I'm trying to achieve here.
>     >
>     > We have a need to be able to test in a more directed and targeted
>     > manner, to be able to focus on a unit of code like lnet or an
>     attribute
>     > of capability like performance. However since starting work on the
>     > Lustre test infrastructure it has become clear to me that knowledge
>     > about the capability, functionality and purpose of individual tests is
>     > very general and held in the heads of Lustre engineers. Because we are
>     > talking about targeting tests we require knowledge about the
>     capability,
>     > functionality and purpose of the tests not the outcome of running the
>     > tests, or to put it another way what the tests can do not what
>     they have
>     > done.
>     >
>     > One key fact about cataloguing the the capabilities of the tests
>     is that
>     > for almost every imaginable case the capability of the test only
>     changes
>     > if the test itself changes and so the rate of change of the data
>     in the
>     > catalogue is the same and actually much less than the rate of change
>     > test code itself. The only exception to this could be that a test
>     > suddenly discovers a new bug which has to have a new ticket
>     attached to
>     > it, although this should be a very very rare if we manage our
>     > development process properly.
>     >
>     > This requirement leads to the conclusion that we need to catalogue all
>     > of the tests within the current test-framework and a catalogue equates
>     > to a database, hence we need a database of the capability,
>     functionality
>     > and purpose of the individual tests. With this requirement in mind it
>     > would be easy to create a database using something like mysql that
>     could
>     > be used by applications like the Lustre test system, but using an
>     > approach like that would make the database very difficult to share and
>     > will be even harder to attach the knowledge to the Lustre tree
>     which is
>     > were it belongs.
>     >
>     > So the question I want to solve is how to catalogue the
>     capabilities of
>     > the individual tests in a database, store that data as part of the
>     > Lustre source and as a bonus make the data readable and even carefully
>     > editable by people as well as machines. Now to focus on the last
>     point I
>     > do not think we should constrain ourselves to something that can
>     be read
>     > by machine using just bash, we do have access to structure
>     languages and
>     > should make use of that fact.
>     >
>     I think we all agree 100% on the above...
> 
>     > The solution to all of this seemed to be to store the catalogue about
>     > the tests as part of the tests themselves
>     ... but not necessarily that conclusion.
>      
> 
>     > , this provides for human and
>     > machine accessibility, implicit version control and certainty the what
>     > ever happens to Lustre source the data goes with it. It is also
>     the case
>     > that by keeping the catalogue with the subject the maintenance of the
>     > catalogue is more likely to occur than if the two are separate.
> 
>     I agree with all those.  But there are some difficulties with this
>     as well:
>     1. bash isn't a great language to encapsulate this metadata
> 
>  
> The thing to focus on I think is the data captured not the format. The
> parser for yaml encapsulated in the source or anywhere else is a small
> amount of effort compared to capturing the data in the first place. If
> we capture the data and it's machine readable then changing the format
> is easy.
> 
> There are many advantages today to keeping the source and the metadata
> in the same place, one being that when reviewing new or updated tests
> the reviewers can and will be encouraged to by the locality to ensure
> the metadata matches the new or revised test. If the two are not
> together then they have very little chance of being kept in sync.

Also I have more then one concerns. You are suggesting to put in bash
structure which has his formal description. Who and when will check that
a embedded structure is correct? Formal structure must be checked by
tools not by eyes. For example I use Rx tools with schema definition for
yaml. Extracting yaml data and checked it separately decrease comfort of
using tools.

To be honest, I don't see big difference between using 2 files and one
file from developer point of view. This is more about discipline
question then comfort. Absolutely same developer could ignore
description which is placed nearly. (From my experience with tests live
cycle, descriptions became good after few cycles of adding,changing and
review them, often it is result of developer-user interaction. As
result, the most problematic tests have the best descriptions)

> 
>     2. this further locks us in to current test implementation - there's
>     not much possibility to start writing tests in another language if
>     we're parsing through looking for bash-formatted metadata. Sure,
>     multiple parsers could be written...
> 
>  
> I don't think it is a lock in at all, the data is machine readable and
> moving to a new format when and should we need it will be easy. Let's
> focus on capturing the data so we increase our knowledge, once we have
> the data we can manipulate it however we want. The data and the metadata
> together in my opinion increases the chance of capturing and updating
> the data given todays methods and tools. 
> 
>     3. difficulty changing md of groups of tests en-mass - eg. add
>     "slow" keyword to a set of tests
> 
> 
> The data can read and written by machine and the libraries/application
> to do this would be written. Referring back to the description of the
> metadata we would not be making sweeping changes to test metadata
> because the metadata should only change when the test changes
> [exceptions will always apply but we should not optimize for exceptions].
> 
> Also I don't think 'slow' would not be part of the metadata because it
> is not an attribute of the test, it is an attribute of how the test is
> used. We need to be strict and clear here. The metadata describes the
> functionality of the test code and slow is not a test code function, if
> we want to be able to select 'slow' then we need to understand what code
> functionality of a test cause it to be a 'slow' test and ensure those
> attributes are captured.

Do you suggest has separated metadata about how test is used? There is
some logical vagueness:  tests metadata can became test usage metadata
and back. Where is border? For example Component from your suggestion
also can be test usage metadata. "SLOW",in general, is set of tests with
big coverage and small time. if we put to tests info about his coverage
it became tests metadata?

>  
> 
>     4. no inheritance of characteristics - each test must explicitly
>     list every piece of md.  This not only blows up the amount of md it
>     also is a source for typos, etc. to cause problems.
> 
> 
> I'm not against inheritance but the inheritance must be explicit not
> implicit we want to draw out knowledge about the tests if we just allow
> people to say 'all 200 tests in this file are X, Y, Z' then that is what
> will happen no one will check each test to make sure it is true and our
> data will be corrupted before we start.

Absolutely same behavior, but with copy-paste, is possible for adding
info to every tests. And i don't see problems with implicit inheritance
of f.e. Components field. In some tests suites is really possible when
all tests have one Components set. More over, I think, it is possible to
get some test Components based on  test coverage automatically. Maybe we
can solve this via enabling implicit inheritance for limited list of fields?

> 
> So explicit inheritance might make sense, and please do propose an
> inheritance model for the data, we can discuss the storage format later
> but today let's just understand how inheritance relates to our bash tests.

What is 'explicit inheritance' in case of your suggestion? Ans why it need?

>  
> 
>     5. no automatic modification of characteristics.  In particular, one
>     piece of md I would like to see is "maximum allowed test time" for
>     each test.  Ideally, this could be measured and adjusted
>     automatically based on historical and ongoing run data.  But it
>     would be dangerous to allow automatic modification to the script itself.
> 
> 
> I really do not think maximum test time as a measurement is a piece of
> test metadata.

If we want to provide some help,advice to new user where we should store
this data? What is difference between Tickets, Component, Purposes and
'Assumed Execution time'? All fields not just precise descriptions but
also are advices.

> 
> Metadata describes the functionality of the test that is encapsulated
> within the test code itself, if the code said 'run for 60 minutes and no
> more' then maximum time would be an attribute.

it will be different field. 60 min - 'max time'. 45 min 'Assumed
Execution time'. No conflict there.

> 
> Maybe there are a set of useful attributes like amount of storage used,
> or minimum clients, or minimum osts etc. etc, again these can only be
> metadata if they are implicitly in the test code, and for most tests
> they would not be definable, and the variability might be impossible to
> systematically capture, although I do think it's worth having a go.

But this data 1) helpful 2) I already use it. Where we could store this
data?

Thanks,
	Roman

>  
>  
> 
>     To address those problems, I think a database-type approach is
>     exactly right, or perhaps a YAML file with hierarchical inheritance.
>     To some degree, this is a "evolution vs revolution" question, and I
>     prefer to come down on the revolution-enabling design, despite the
>     problems you list.  Basically, I believe the separated MD model
>     allows for the replacement of test-framework, and this, to my mind,
>     is the majority driver for adding the MD at all.
> 
> Database is good and I believe metadata in the source fulfils that
> objective whilst being something that we can manage with what we have
> today manually, whilst easily creating tools for some automation. When
> we do begin work on a new test framework approach we will have all the
> data at hand to be manipulated in any way that we want, including if we
> want separating it and storing it somewhere else.
> 
> I don't think creating the metadata however is linked with a new
> test-framework, creating the metadata is required because today we do
> not know what we have and we need to know what we have today whatever
> strategy we use for the future.
> 
> Chris
>