Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[tsf-dev] TSF process feedback - part 1

Hello,

Thank you for all the emails, they have been a good read. This email will be responses to excerpts of the thread, I will follow tomorrow/overmorrow with a more complete, and more nicely written, set of thoughts/reflections that will tie up a lot of what is written here.

# Part 1

## On Constructing Graphs

Ssam: Another gotcha: our graph of statements has become simply too big, and has too many different owners. On reflection, it seems like 100 statements is a sensible target for a TSF graph, and we should perhaps enforce an upper limit of 500 statements.

The size of the graph is a function of both the scope of the argument, and the level of granularity applied at each step (subclaims -> claim). With this, I believe setting an arbitrary "upper limit" is not a particularly good solution as both these variables will vary widely for different projects. If the graph becomes unwieldy, consider both and do some tuning.

Derek: Limiting the size will limit it’s applicability. Some form of modularization is needed.

I agree that the right approach is modularisation. The current remote graph implementation offers only limited capabilities, and is a bit of a faff to use. Notably, with the current remote graph, you are not able to segment the argument for a given system into independent arguments for each subsystem. I have been working on a new implementation tracked at https://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/514. Modules are exposed, decoupled from the "main" graph; to import a module you define it's source and it is fetched and versioned and should be usable within the project relatively seamlessly (it's basically a minimal package manager).

Ssam: One gotcha: people have at times imported existing texts into TSF as Statements, without taking time to rewrite them. One example was a step by step process where each step became a statement.

Existing texts and standards are not written with regard to TSF's model, effort needs to be put in to word each statement carefully such that the argument also makes sense within the framework.

There is an automatic mode of thinking when people try to represent an existing text/standard where each claim must be captured and satisfied verbatim; I don't believe this is the optimal approach. For example, I mapped an argument for an ISO standard where I originally had each clause as a claim and found the graph was getting exponentially large. Instead, I refocussed the graph such that claims were not exact copies of the standard's claims and not decomposed as aggressively, but still captured the essence of what the standard was looking for. I found this to be an approach that ended with a better graph.

Ssam: The names can get long. But its less cognitive load than having to keep hundreds of numbered codes in your head.

I agree the framework encourages weird item nomenclature, though I think this is partly not the fault of the framework and pertains to the idea that "naming things is one of the hardest problems in software". That being said, I'm not sure how to solve the problem, because we want names to be both unique and descriptive in possibly a graph of ~500 items. Hopefully the "trudag shell" command and the included autocomplete for items has helped with the repetitious task of typing item names for each command.

## References as Evidence

Ssam:
* Is this the *right* evidence?
* Is this *enough* evidence?
* What context do you need to *understand* this evidence?
* How often will this piece of evidence change and *require re-review*?

For the first 2, should be reframed:
- Is this *relevant* evidence for the claim it is supporting?
- Is this *sufficient* evidence to convince the audience that the claim is true?

For 3 we could add a field which allows you to tie the evidence to the claim it is supporting (I assume this is what you mean).

For 4, you can never be sure before you have data, especially if what is being referenced is not in your control.

I have toyed around with the idea of an optional parameter on references, let's call it "interval". You would set this such that trudag will only re-fetch and rehash the reference content if (last_fetch_time + interval) < current_time.

SSam:
Some anti-patterns that we've seen:
* Evidence that relates to the topic of the statement, but doesn't demonstrate that it's true. * Evidence that is a small section of a file, but someone referenced the entire file. * Evidence that is serialized data in some format, but without any link to documentation of the data model or the program that reads it.

1. In the case of a reference as evidence it's ultimately on the SME to evaluate to what degree a statement is true, given the the referenced content; referenced content can directly or indirectly contextualise this assessment. 2. They shouldn't reference the whole file. I think we can improve the built-in reference types in the tooling to more natively support more granular referencing, akin to what you have done for custom reference types.
3. I think this could be a feature (an issue maybe?).

Ssam: References might be in the same repo as the graph, but in many cases they are stored outside the Git repo. Each external reference adds a cost, as Trudag will fetch it on startup to see if it changed. If this comes in via network, this means that generating your assurance case now depends on some network infrastructure, and if you don't host the file yourself, you should think about mirroring it.

There are quite a few annoying issues with remote references:
- Yes, requires network connection; pain but necessary.
- We never cache the content or hash of the "last known" version of the reference. This means when it comes to reviewing a hash change, where the cause is the change in the reference, there is no diff to provide the reviewer context. - To achieve a diff, we need a lock file for references containing the last commit hash of the reference, so it can be re-fetched and compared. - This is another situation where we are constrained by the .dot format; there is no support for lists or dicts in node/edge attribute sets.

Derek: My experience with cited evidence for statements in software engineering is that it's often unconnected with what is claimed, or is circumstantial, or the data is poorly analysed. Bedtime reading: http://knosof.co.uk/ESEUR/
'evidence' also needs to come with confidence bounds.

Book looks interesting, will look into it.

Need to think about this specific topic more.

Lesson: keep references to the minimum necessary, and review them carefully.

Rephrasing: only reference what is absolutely necessary, review reference relevancy periodically.

# Part 2/3

For individual statements, it can be nice for an expert to record how close we are to achieving some goal, by putting e.g. 0.5 against the statement to show that we’re halfway there.

I think this can be reworded as "for an expert to record the current degree of sufficiency for a claim".

An estimate of a claim's completeness does exist in the scoring algorithm, there is just no way to express it on the nodes currently. I see no reason why this shouldn't be a feature, apart from a belief that humans won't be particularly good at estimating this, so I have opened an issue: https://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/515.

Ssam: “All documentation is in Git”, but I know that half of the documentation lives in Google Docs, how should I score that statement?
> Derek: The answer is 0

The answer should be 0, the claim is not true. This is a situation where tsf "score" can be inconsistent.

## On Validators
Ssam
`If Foo receives invalid input, it raises an error.`
* Does the testing cover all possible invalid inputs?
* When did those tests actually last run? How old is the report you're reading? * How do failing tests affect the score? Should it go to zero if a test failed? (See the related question in the previous section).
* Is the validator script doing what you expect? Maybe it has a bug.

You are able sensitivities which describe, for each claim, how much every other claim influences it's score. Therefore, you can work this out for every claim wrt a specified evidence.

Validators should be tested. It could be another claim in the argument: "all validator scripts behave as expected".

In general, Derek is correct in that most of these should be answered by claims and evidence in the graph. Though I think scalability may be an issue if we have to add such statements for every instance of "tests".

Ssam: One TSF feature that would help a lot here is allowing validator plugins to generate content that appears in the report.

I agree. There is a slight problem, which we also have with references, in how to render arbitrary validator data. Note I think this can be solved, it just needs a generic solution, not one tied intrinsically to mkdocs. A new validator schema is a solution here, where we support both but gradually move away from the current version.

Ssam: Another thing to be careful of, is the more advanced your testing the more complex it becomes to generate the report. Similar to external references, if your validator plugins pull data from external sources you need to worry about availability and access control.

There is a tradeoff here, for each validator, between:
1. Compute elsewhere and the validator interprets the output.
- Allows for dependency chains and parallelism to be handled by another tool e.g. CI, but offers less in proof that the data being interpreted is also created by the same system state (i.e. you could be using bad data).
2. Trudag runs both computation and interprets the output.
- Allows closer coupling between data creation and interpretation, but trudag does not offer assurances on how and when to run validators relative to others.

Luckily, validators are relatively flexible (you can write them however you please), so 1. and 2. can be applied as required. It is worth recognising the tradeoff though.

## On Scoring

Ssam: Lesson: Agree what the scores mean within your team!
This means that scores are different across teams/projects, which makes them useless for anything other than an indicator of team progress (which is not the purpose of these scores).
TSF might need to specify what the score means in more detail.
> Derek: This means that scores are different across teams/projects, which makes them useless for anything other than an indicator of team progress (which is not the purpose of these scores).
> Without an agreed method of scoring it is not possible to combine scores to create a meaningful global value.

The score in current TSF has no "meaning" (I think the docs state it's confidence/probability in a statement but in my view it does not function like this at all). That's not to say it's useless, it just functions as more of an indicator, it goes up if evidence is, in general, good, and down otherwise. It can help indicate where to direct your attention during continuous evaluation.

I will expand on this in the following email. But given the definitions in the documentation:

Docs: A Link from Statement A to Statement B means that Statement A logically implies Statement B. It can be helpful to remember this is equivalent to "B is a necessary but not sufficient condition for A". By convention, we refer to Statement A as the parent and B as the child.

I'm not sufficiently clued up on logic but this seems wrong. It's defining the logical implication "A -> B_i", which does no argumentative work, we can't say anything about A given information about B.

TSF evaluates higher level claims on the basis of evidence (at the bottom), we really want to state the inverse relationship: "what do we learn about A from all B".

This is not a solved problem, there is a whole area of research for this.

Note that the proposal @ https://gitlab.eclipse.org/eclipse/tsf/tsf/-/merge_requests/596 does solve for this issue by enforcing that the sub-claims *entail* the claim, that is, if the sub-claims are true, the claim must be true; switching from "A -> B_i" to "B_1 AND B_2 AND B_N -> A". Short story, it simplifies this specific problem.

Ssam: Trudag generates a score based on some maths that I don’t fully understand

The maths is actually quite simple behind all the symbols: score(claim) = claim_completeness * mean([score(claim) for chid in claim.children])

# Part 4

Ssam: All of these mails are building to the key selling point of TSF and the reason we use it: our assurance case is stored in Git, alongside the product we are building, and the two evolve together.

I agree, the novelty of TSF comes from how the assurance case is continually managed and the accompanying processes.

Ssam: This is needed because of external references: some statements may be marked as Suspect through no fault of the MR author.

Perhaps remote and local references should be treated differently by the tooling. A flag for linting`--local` that only considers changes to local references that can be run in MR pipelines so that contamination from unrelated external references are ignored. Then during merge or release pipelines this flag is unset, so that the newly unreviewed items/links can be dealt with separately.

Ssam: The downside is it does slow down development. Sometimes an SME is unavailable, and you have to decide whether to wait for them to return, or have someone else take over their score. It can lead to four reviewers being called in over a one-line change that just removes whitespace. And large changes that touch many files can be much more expensive to land.

Ssam: This is a problem that all software projects have, and my only advice is to be flexible

Will keep in mind the churn? for reviewers when implementing changes.

Best Regards,

Nathan


Back to the top