[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
|
[tsf-dev] TSF process feedback - part 1
|
Hello,
Thank you for all the emails, they have been a good read. This email
will be responses to excerpts of the thread, I will follow
tomorrow/overmorrow with a more complete, and more nicely written, set
of thoughts/reflections that will tie up a lot of what is written here.
# Part 1
## On Constructing Graphs
Ssam: Another gotcha: our graph of statements has become simply too
big, and has too many different owners. On reflection, it seems like
100 statements is a sensible target for a TSF graph, and we should
perhaps enforce an upper limit of 500 statements.
The size of the graph is a function of both the scope of the argument,
and the level of granularity applied at each step (subclaims -> claim).
With this, I believe setting an arbitrary "upper limit" is not a
particularly good solution as both these variables will vary widely for
different projects. If the graph becomes unwieldy, consider both and do
some tuning.
Derek: Limiting the size will limit it’s applicability. Some form of
modularization is needed.
I agree that the right approach is modularisation. The current remote
graph implementation offers only limited capabilities, and is a bit of a
faff to use. Notably, with the current remote graph, you are not able to
segment the argument for a given system into independent arguments for
each subsystem. I have been working on a new implementation tracked at
https://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/514. Modules are
exposed, decoupled from the "main" graph; to import a module you define
it's source and it is fetched and versioned and should be usable within
the project relatively seamlessly (it's basically a minimal package
manager).
Ssam: One gotcha: people have at times imported existing texts into TSF
as Statements, without taking time to rewrite them. One example was a
step by step process where each step became a statement.
Existing texts and standards are not written with regard to TSF's model,
effort needs to be put in to word each statement carefully such that the
argument also makes sense within the framework.
There is an automatic mode of thinking when people try to represent an
existing text/standard where each claim must be captured and satisfied
verbatim; I don't believe this is the optimal approach. For example, I
mapped an argument for an ISO standard where I originally had each
clause as a claim and found the graph was getting exponentially large.
Instead, I refocussed the graph such that claims were not exact copies
of the standard's claims and not decomposed as aggressively, but still
captured the essence of what the standard was looking for. I found this
to be an approach that ended with a better graph.
Ssam: The names can get long. But its less cognitive load than having
to keep hundreds of numbered codes in your head.
I agree the framework encourages weird item nomenclature, though I think
this is partly not the fault of the framework and pertains to the idea
that "naming things is one of the hardest problems in software". That
being said, I'm not sure how to solve the problem, because we want names
to be both unique and descriptive in possibly a graph of ~500 items.
Hopefully the "trudag shell" command and the included autocomplete for
items has helped with the repetitious task of typing item names for each
command.
## References as Evidence
Ssam:
* Is this the *right* evidence?
* Is this *enough* evidence?
* What context do you need to *understand* this evidence?
* How often will this piece of evidence change and *require re-review*?
For the first 2, should be reframed:
- Is this *relevant* evidence for the claim it is supporting?
- Is this *sufficient* evidence to convince the audience that the claim
is true?
For 3 we could add a field which allows you to tie the evidence to the
claim it is supporting (I assume this is what you mean).
For 4, you can never be sure before you have data, especially if what is
being referenced is not in your control.
I have toyed around with the idea of an optional parameter on
references, let's call it "interval". You would set this such that
trudag will only re-fetch and rehash the reference content if
(last_fetch_time + interval) < current_time.
SSam:
Some anti-patterns that we've seen:
* Evidence that relates to the topic of the statement, but doesn't
demonstrate that it's true.
* Evidence that is a small section of a file, but someone referenced
the entire file.
* Evidence that is serialized data in some format, but without any link
to documentation of the data model or the program that reads it.
1. In the case of a reference as evidence it's ultimately on the SME to
evaluate to what degree a statement is true, given the the referenced
content; referenced content can directly or indirectly contextualise
this assessment.
2. They shouldn't reference the whole file. I think we can improve the
built-in reference types in the tooling to more natively support more
granular referencing, akin to what you have done for custom reference
types.
3. I think this could be a feature (an issue maybe?).
Ssam: References might be in the same repo as the graph, but in many
cases they are stored outside the Git repo. Each external reference
adds a cost, as Trudag will fetch it on startup to see if it changed.
If this comes in via network, this means that generating your assurance
case now depends on some network infrastructure, and if you don't host
the file yourself, you should think about mirroring it.
There are quite a few annoying issues with remote references:
- Yes, requires network connection; pain but necessary.
- We never cache the content or hash of the "last known" version of the
reference. This means when it comes to reviewing a hash change, where
the cause is the change in the reference, there is no diff to provide
the reviewer context.
- To achieve a diff, we need a lock file for references containing
the last commit hash of the reference, so it can be re-fetched and
compared.
- This is another situation where we are constrained by the .dot
format; there is no support for lists or dicts in node/edge attribute
sets.
Derek: My experience with cited evidence for statements in software
engineering is that it's often unconnected with what is claimed, or is
circumstantial, or the data is poorly analysed. Bedtime reading:
http://knosof.co.uk/ESEUR/
'evidence' also needs to come with confidence bounds.
Book looks interesting, will look into it.
Need to think about this specific topic more.
Lesson: keep references to the minimum necessary, and review them
carefully.
Rephrasing: only reference what is absolutely necessary, review
reference relevancy periodically.
# Part 2/3
For individual statements, it can be nice for an expert to record how
close we are to achieving some goal, by putting e.g. 0.5 against the
statement to show that we’re halfway there.
I think this can be reworded as "for an expert to record the current
degree of sufficiency for a claim".
An estimate of a claim's completeness does exist in the scoring
algorithm, there is just no way to express it on the nodes currently. I
see no reason why this shouldn't be a feature, apart from a belief that
humans won't be particularly good at estimating this, so I have opened
an issue: https://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/515.
Ssam: “All documentation is in Git”, but I know that half of the
documentation lives in Google Docs, how should I score that statement?
> Derek: The answer is 0
The answer should be 0, the claim is not true. This is a situation where
tsf "score" can be inconsistent.
## On Validators
Ssam
`If Foo receives invalid input, it raises an error.`
* Does the testing cover all possible invalid inputs?
* When did those tests actually last run? How old is the report you're
reading?
* How do failing tests affect the score? Should it go to zero if a test
failed? (See the related question in the previous section).
* Is the validator script doing what you expect? Maybe it has a bug.
You are able sensitivities which describe, for each claim, how much
every other claim influences it's score. Therefore, you can work this
out for every claim wrt a specified evidence.
Validators should be tested. It could be another claim in the argument:
"all validator scripts behave as expected".
In general, Derek is correct in that most of these should be answered by
claims and evidence in the graph. Though I think scalability may be an
issue if we have to add such statements for every instance of "tests".
Ssam: One TSF feature that would help a lot here is allowing validator
plugins to generate content that appears in the report.
I agree. There is a slight problem, which we also have with references,
in how to render arbitrary validator data. Note I think this can be
solved, it just needs a generic solution, not one tied intrinsically to
mkdocs. A new validator schema is a solution here, where we support both
but gradually move away from the current version.
Ssam: Another thing to be careful of, is the more advanced your testing
the more complex it becomes to generate the report. Similar to external
references, if your validator plugins pull data from external sources
you need to worry about availability and access control.
There is a tradeoff here, for each validator, between:
1. Compute elsewhere and the validator interprets the output.
- Allows for dependency chains and parallelism to be handled by
another tool e.g. CI, but offers less in proof that the data being
interpreted is also created by the same system state (i.e. you could be
using bad data).
2. Trudag runs both computation and interprets the output.
- Allows closer coupling between data creation and interpretation,
but trudag does not offer assurances on how and when to run validators
relative to others.
Luckily, validators are relatively flexible (you can write them however
you please), so 1. and 2. can be applied as required. It is worth
recognising the tradeoff though.
## On Scoring
Ssam: Lesson: Agree what the scores mean within your team!
This means that scores are different across teams/projects, which makes
them useless for anything other than an indicator of team progress
(which is not the purpose of these scores).
TSF might need to specify what the score means in more detail.
> Derek: This means that scores are different across teams/projects, which makes them useless for anything other than an indicator of team progress (which is not the purpose of these scores).
> Without an agreed method of scoring it is not possible to combine scores to create a meaningful global value.
The score in current TSF has no "meaning" (I think the docs state it's
confidence/probability in a statement but in my view it does not
function like this at all). That's not to say it's useless, it just
functions as more of an indicator, it goes up if evidence is, in
general, good, and down otherwise. It can help indicate where to direct
your attention during continuous evaluation.
I will expand on this in the following email. But given the definitions
in the documentation:
Docs: A Link from Statement A to Statement B means that Statement A
logically implies Statement B. It can be helpful to remember this is
equivalent to "B is a necessary but not sufficient condition for A". By
convention, we refer to Statement A as the parent and B as the child.
I'm not sufficiently clued up on logic but this seems wrong. It's
defining the logical implication "A -> B_i", which does no argumentative
work, we can't say anything about A given information about B.
TSF evaluates higher level claims on the basis of evidence (at the
bottom), we really want to state the inverse relationship: "what do we
learn about A from all B".
This is not a solved problem, there is a whole area of research for
this.
Note that the proposal @
https://gitlab.eclipse.org/eclipse/tsf/tsf/-/merge_requests/596 does
solve for this issue by enforcing that the sub-claims *entail* the
claim, that is, if the sub-claims are true, the claim must be true;
switching from "A -> B_i" to "B_1 AND B_2 AND B_N -> A". Short story, it
simplifies this specific problem.
Ssam: Trudag generates a score based on some maths that I don’t fully
understand
The maths is actually quite simple behind all the symbols: score(claim)
= claim_completeness * mean([score(claim) for chid in claim.children])
# Part 4
Ssam: All of these mails are building to the key selling point of TSF
and the reason we use it: our assurance case is stored in Git,
alongside the product we are building, and the two evolve together.
I agree, the novelty of TSF comes from how the assurance case is
continually managed and the accompanying processes.
Ssam: This is needed because of external references: some statements
may be marked as Suspect through no fault of the MR author.
Perhaps remote and local references should be treated differently by the
tooling. A flag for linting`--local` that only considers changes to
local references that can be run in MR pipelines so that contamination
from unrelated external references are ignored. Then during merge or
release pipelines this flag is unset, so that the newly unreviewed
items/links can be dealt with separately.
Ssam: The downside is it does slow down development. Sometimes an SME
is unavailable, and you have to decide whether to wait for them to
return, or have someone else take over their score. It can lead to four
reviewers being called in over a one-line change that just removes
whitespace. And large changes that touch many files can be much more
expensive to land.
Ssam: This is a problem that all software projects have, and my only
advice is to be flexible
Will keep in mind the churn? for reviewers when implementing changes.
Best Regards,
Nathan