[tsf-dev] TSF process feedback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[tsf-dev] TSF process feedback - part 1

From: Nathan Warren <nathan.warren@xxxxxxxxxxxxxxx>
Date: Wed, 29 Apr 2026 10:40:14 +0100
Delivered-to: tsf-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/tsf-dev/>
List-help: <mailto:tsf-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/tsf-dev>, <mailto:tsf-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/tsf-dev>, <mailto:tsf-dev-request@eclipse.org?subject=unsubscribe>
Organization: Codethink

Hello,

Thank you for all the emails, they have been a good read. This emailwill be responses to excerpts of the thread, I will followtomorrow/overmorrow with a more complete, and more nicely written, setof thoughts/reflections that will tie up a lot of what is written here.


# Part 1

## On Constructing Graphs

Ssam: Another gotcha: our graph of statements has become simply toobig, and has too many different owners. On reflection, it seems like100 statements is a sensible target for a TSF graph, and we shouldperhaps enforce an upper limit of 500 statements.

The size of the graph is a function of both the scope of the argument,and the level of granularity applied at each step (subclaims -> claim).With this, I believe setting an arbitrary "upper limit" is not aparticularly good solution as both these variables will vary widely fordifferent projects. If the graph becomes unwieldy, consider both and dosome tuning.

Derek: Limiting the size will limit it’s applicability. Some form ofmodularization is needed.

I agree that the right approach is modularisation. The current remotegraph implementation offers only limited capabilities, and is a bit of afaff to use. Notably, with the current remote graph, you are not able tosegment the argument for a given system into independent arguments foreach subsystem. I have been working on a new implementation tracked athttps://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/514. Modules areexposed, decoupled from the "main" graph; to import a module you defineit's source and it is fetched and versioned and should be usable withinthe project relatively seamlessly (it's basically a minimal packagemanager).

Ssam: One gotcha: people have at times imported existing texts into TSFas Statements, without taking time to rewrite them. One example was astep by step process where each step became a statement.

Existing texts and standards are not written with regard to TSF's model,effort needs to be put in to word each statement carefully such that theargument also makes sense within the framework.

There is an automatic mode of thinking when people try to represent anexisting text/standard where each claim must be captured and satisfiedverbatim; I don't believe this is the optimal approach. For example, Imapped an argument for an ISO standard where I originally had eachclause as a claim and found the graph was getting exponentially large.Instead, I refocussed the graph such that claims were not exact copiesof the standard's claims and not decomposed as aggressively, but stillcaptured the essence of what the standard was looking for. I found thisto be an approach that ended with a better graph.

Ssam: The names can get long. But its less cognitive load than havingto keep hundreds of numbered codes in your head.

I agree the framework encourages weird item nomenclature, though I thinkthis is partly not the fault of the framework and pertains to the ideathat "naming things is one of the hardest problems in software". Thatbeing said, I'm not sure how to solve the problem, because we want namesto be both unique and descriptive in possibly a graph of ~500 items.Hopefully the "trudag shell" command and the included autocomplete foritems has helped with the repetitious task of typing item names for eachcommand.


## References as Evidence

Ssam:
* Is this the *right* evidence?
* Is this *enough* evidence?
* What context do you need to *understand* this evidence?
* How often will this piece of evidence change and *require re-review*?


For the first 2, should be reframed:
- Is this *relevant* evidence for the claim it is supporting?

- Is this *sufficient* evidence to convince the audience that the claimis true?

For 3 we could add a field which allows you to tie the evidence to theclaim it is supporting (I assume this is what you mean).

For 4, you can never be sure before you have data, especially if what isbeing referenced is not in your control.

I have toyed around with the idea of an optional parameter onreferences, let's call it "interval". You would set this such thattrudag will only re-fetch and rehash the reference content if(last_fetch_time + interval) < current_time.

SSam:
Some anti-patterns that we've seen:
* Evidence that relates to the topic of the statement, but doesn'tdemonstrate that it's true.* Evidence that is a small section of a file, but someone referencedthe entire file.* Evidence that is serialized data in some format, but without any linkto documentation of the data model or the program that reads it.

1. In the case of a reference as evidence it's ultimately on the SME toevaluate to what degree a statement is true, given the the referencedcontent; referenced content can directly or indirectly contextualisethis assessment.2. They shouldn't reference the whole file. I think we can improve thebuilt-in reference types in the tooling to more natively support moregranular referencing, akin to what you have done for custom referencetypes.

3. I think this could be a feature (an issue maybe?).

Ssam: References might be in the same repo as the graph, but in manycases they are stored outside the Git repo. Each external referenceadds a cost, as Trudag will fetch it on startup to see if it changed.If this comes in via network, this means that generating your assurancecase now depends on some network infrastructure, and if you don't hostthe file yourself, you should think about mirroring it.


There are quite a few annoying issues with remote references:
- Yes, requires network connection; pain but necessary.

- We never cache the content or hash of the "last known" version of thereference. This means when it comes to reviewing a hash change, wherethe cause is the change in the reference, there is no diff to providethe reviewer context.- To achieve a diff, we need a lock file for references containingthe last commit hash of the reference, so it can be re-fetched andcompared.- This is another situation where we are constrained by the .dotformat; there is no support for lists or dicts in node/edge attributesets.

Derek: My experience with cited evidence for statements in softwareengineering is that it's often unconnected with what is claimed, or iscircumstantial, or the data is poorly analysed. Bedtime reading:http://knosof.co.uk/ESEUR/
'evidence' also needs to come with confidence bounds.


Book looks interesting, will look into it.

Need to think about this specific topic more.

Lesson: keep references to the minimum necessary, and review themcarefully.

Rephrasing: only reference what is absolutely necessary, reviewreference relevancy periodically.


# Part 2/3

For individual statements, it can be nice for an expert to record howclose we are to achieving some goal, by putting e.g. 0.5 against thestatement to show that we’re halfway there.

I think this can be reworded as "for an expert to record the currentdegree of sufficiency for a claim".

An estimate of a claim's completeness does exist in the scoringalgorithm, there is just no way to express it on the nodes currently. Isee no reason why this shouldn't be a feature, apart from a belief thathumans won't be particularly good at estimating this, so I have openedan issue: https://gitlab.eclipse.org/eclipse/tsf/tsf/-/issues/515.

Ssam: “All documentation is in Git”, but I know that half of thedocumentation lives in Google Docs, how should I score that statement?
> Derek: The answer is 0

The answer should be 0, the claim is not true. This is a situation wheretsf "score" can be inconsistent.


## On Validators

Ssam
`If Foo receives invalid input, it raises an error.`
* Does the testing cover all possible invalid inputs?
* When did those tests actually last run? How old is the report you'rereading?* How do failing tests affect the score? Should it go to zero if a testfailed? (See the related question in the previous section).
* Is the validator script doing what you expect? Maybe it has a bug.

You are able sensitivities which describe, for each claim, how muchevery other claim influences it's score. Therefore, you can work thisout for every claim wrt a specified evidence.

Validators should be tested. It could be another claim in the argument:"all validator scripts behave as expected".

In general, Derek is correct in that most of these should be answered byclaims and evidence in the graph. Though I think scalability may be anissue if we have to add such statements for every instance of "tests".

Ssam: One TSF feature that would help a lot here is allowing validatorplugins to generate content that appears in the report.

I agree. There is a slight problem, which we also have with references,in how to render arbitrary validator data. Note I think this can besolved, it just needs a generic solution, not one tied intrinsically tomkdocs. A new validator schema is a solution here, where we support bothbut gradually move away from the current version.

Ssam: Another thing to be careful of, is the more advanced your testingthe more complex it becomes to generate the report. Similar to externalreferences, if your validator plugins pull data from external sourcesyou need to worry about availability and access control.


There is a tradeoff here, for each validator, between:
1. Compute elsewhere and the validator interprets the output.

- Allows for dependency chains and parallelism to be handled byanother tool e.g. CI, but offers less in proof that the data beinginterpreted is also created by the same system state (i.e. you could beusing bad data).

2. Trudag runs both computation and interprets the output.

- Allows closer coupling between data creation and interpretation,but trudag does not offer assurances on how and when to run validatorsrelative to others.

Luckily, validators are relatively flexible (you can write them howeveryou please), so 1. and 2. can be applied as required. It is worthrecognising the tradeoff though.


## On Scoring

Ssam: Lesson: Agree what the scores mean within your team!

This means that scores are different across teams/projects, which makesthem useless for anything other than an indicator of team progress(which is not the purpose of these scores).

TSF might need to specify what the score means in more detail.
> Derek: This means that scores are different across teams/projects, which makes them useless for anything other than an indicator of team progress (which is not the purpose of these scores).
> Without an agreed method of scoring it is not possible to combine scores to create a meaningful global value.

The score in current TSF has no "meaning" (I think the docs state it'sconfidence/probability in a statement but in my view it does notfunction like this at all). That's not to say it's useless, it justfunctions as more of an indicator, it goes up if evidence is, ingeneral, good, and down otherwise. It can help indicate where to directyour attention during continuous evaluation.

I will expand on this in the following email. But given the definitionsin the documentation:

Docs: A Link from Statement A to Statement B means that Statement Alogically implies Statement B. It can be helpful to remember this isequivalent to "B is a necessary but not sufficient condition for A". Byconvention, we refer to Statement A as the parent and B as the child.

I'm not sufficiently clued up on logic but this seems wrong. It'sdefining the logical implication "A -> B_i", which does no argumentativework, we can't say anything about A given information about B.

TSF evaluates higher level claims on the basis of evidence (at thebottom), we really want to state the inverse relationship: "what do welearn about A from all B".

This is not a solved problem, there is a whole area of research forthis.

Note that the proposal @https://gitlab.eclipse.org/eclipse/tsf/tsf/-/merge_requests/596 doessolve for this issue by enforcing that the sub-claims *entail* theclaim, that is, if the sub-claims are true, the claim must be true;switching from "A -> B_i" to "B_1 AND B_2 AND B_N -> A". Short story, itsimplifies this specific problem.

Ssam: Trudag generates a score based on some maths that I don’t fullyunderstand

The maths is actually quite simple behind all the symbols: score(claim)= claim_completeness * mean([score(claim) for chid in claim.children])


# Part 4

Ssam: All of these mails are building to the key selling point of TSFand the reason we use it: our assurance case is stored in Git,alongside the product we are building, and the two evolve together.

I agree, the novelty of TSF comes from how the assurance case iscontinually managed and the accompanying processes.

Ssam: This is needed because of external references: some statementsmay be marked as Suspect through no fault of the MR author.

Perhaps remote and local references should be treated differently by thetooling. A flag for linting`--local` that only considers changes tolocal references that can be run in MR pipelines so that contaminationfrom unrelated external references are ignored. Then during merge orrelease pipelines this flag is unset, so that the newly unrevieweditems/links can be dealt with separately.

Ssam: The downside is it does slow down development. Sometimes an SMEis unavailable, and you have to decide whether to wait for them toreturn, or have someone else take over their score. It can lead to fourreviewers being called in over a one-line change that just removeswhitespace. And large changes that touch many files can be much moreexpensive to land.

Ssam: This is a problem that all software projects have, and my onlyadvice is to be flexible


Will keep in mind the churn? for reviewers when implementing changes.

Best Regards,

Nathan

Follow-Ups:
- Re: [tsf-dev] TSF process feedback - part 1
  - From: Derek M Jones

Prev by Date: Re: [tsf-dev] TSF process feedback, part 4: Change management, RAFIA
Next by Date: [tsf-dev] TSF In The Context of Github
Previous by thread: [tsf-dev] TSF process feedback, part 4: Change management, RAFIA
Next by thread: Re: [tsf-dev] TSF process feedback - part 1
Index(es):
- Date
- Thread

Breadcrumbs