Home » Eclipse Projects » Eclipse Titan » Strategies for shutting down complex component hierarchy
Strategies for shutting down complex component hierarchy [message #1836986] |
Wed, 20 January 2021 15:19 |
|
Dear TITAN community,
at Osmocom., we have one problem that hits us again and again over the years: How to properly shut down a complex hierarchy of components exchainging messages safely, without running into Dynamic test case errors during teardown.
In one of our typical test cases, we set up dozens to sometimes hundreds of components, most of them with internal ports connected between components. Maybe only half of those components are actual test cases, the others are just emulatiing some underlying protocol stack, or in some way facilitating the connection beween the IUT and the tets.
At some point, the actual test case components are terminating, and let's assume the test concluded successfully. All components up to now have either verdict "none" or "pass".
Then, the runtime starts stopping the various components, in whatever order. At this point it may happen that one of the components still is processing some message and sends it to another component that has alreay been shut down -> boom. Dynamic Test Error is reported, and the overal lverdict becomes "error", even though all of the tests had passed before.
Originally I had hoped that it is sufficient to simply make sure that all external ports like IPL4asp are clsoed first, so no external messages can be received anymore. That helps, but...
This can also be triggered by timeouts, as many protocol layers have internal timers for sending keep-alives. If such a timer fires while whetever is the next component has already been terminated -> boom.
We already tried to do an explicit "all component.stop()", it doesns't help either.
The most obvious solution to this problem would be to either have
* some way to "lock" the current verdict, i.e. whatever happens beyond this point can no longer affect the overall verdict. The test author could simply put that "lock" instruction once he knwos that anything relevant to his test has finished, and everything else happening thereafter is irrelevant for the overall verdict and must be ignored
or
* some way to reliably prevent all components from sending furrther messages through their ports
I'm somewhat surprised that there appears to be no obvious solution to the problem.
Sure, in "textbook TTCN3" you would probably normally have all of this complex non-testcase code as part of the "system simulator" which re resides outside of your TTCN3 runtime and hence on the "other" side of the test ports.
However, in TITAN with its capacity for "internal" ports between components, it is much more likely that there is a lot of code that just provides underlying logic/transport for the actual test casese. Even Ericsson has released such code like the SCCP_Emulation - so it seems to be an acceptable programming paradigm.
I also though it would be possible to designate entire component (types) as "not relevant to the final verdict", but that does not help. You still want to be able to catch violations of lower-lyer protocols inside those "emulation" components and do a setverdict(fail) if you encouter such a problem.
How do other people solve this ? How do you suggest to proceed?
Thanks,
Harald
[Updated on: Tue, 30 November 2021 10:24] Report message to a moderator
|
|
| |
Re: Strategies for shutdding down complex component hierarchy [message #1837003 is a reply to message #1836990] |
Wed, 20 January 2021 20:12 |
|
Gábor Szalai wrote on Wed, 20 January 2021 16:43In our internal simulator, solved the issue using a controller component.
Yes, it is obviously possible to do this. But it is a *lot* of effort to implement such a port in every component, and repeat the related code over and over again. To me, that is a very expensive work-around for something that sounds relatively easy to do within the runtime.
We often have relatively complex/deep hierarchies, so it would be very difficult for one central component to even know all other components / component references. There is normally no need for that kind of knowledge. Yes, one could delegate the task of forwarding/flooding this "shutdown" request across the hierarchy. But really? Repeating the same code again and again?
Furthermore, not all components we use are developed by Osmocom. Take for example the SCCP_Emulation we use from the TITAN project. Adding such an "SHUTDOWN" interface would mean we'd have to fork every 3rd party component we use, maintain our own patchset on top, ...
If there weas some generic support by the TITAN runtime, one would reduce a lot of explicit extra code in every TITAN component out there and avoid all of the extra effort.
I still have a hard time believing that this has not been solved before in a "one size fits all" approach, for everyone. I hoped I simply had missed this feature. Given that you also seem to create elaborate work-arounds, it seems like it really doesn't exist :(
Do you think it's feasible to implement either of the approaches I suggested? I'm not familiar with the TITAN internals, but I might be able to have a look how verdicts are collected and try to implement the "verdict freeze'
|
|
| |
Re: Strategies for shutdding down complex component hierarchy [message #1837298 is a reply to message #1837288] |
Wed, 27 January 2021 09:21 |
|
Hi Gabor,
yes, "stop" can be used (and is used by us). However, if you stop let's say 50 parallel test components (e.g. with "all component.stop") then that "stop" is neither atomic, nor does it leave the overall system state in a way that prevents further DTE.
Particularly, incoming messages arriving on those ports meanwhile from other components still result in dynamic test case errors. Or maybe it's on te transmit side, if you try to send to a component reference of a stopped PTC - at least it still causes DTE.
So either
a) the 'all component.stop' would have to be somehow atomic/synchronized, to prevent any other component from sending further messages before all of them are stopped
b) the internal ports between the PTC somehow have to enter a new state that doesn't cause a DTE if a message is sent to a stopped PTC.
Please note the above reflects my "external" view. I have very limited knowledge of the TITAN internals / implementation, I just share my experience as somebody who is developing relatively complex test suites on it and observing the behavior.
|
|
|
Re: Strategies for shutdding down complex component hierarchy [message #1837302 is a reply to message #1837298] |
Wed, 27 January 2021 09:31 |
|
Also, another problem in this context is that "all component.stop" seems to only work when execute from the MTC. Not sure if that's a TTCN3 language limitation or one of the TITAN implementation?
That's rather inconvenient, as in complex component hierarchies, the decision if and when to stop (for example in case of "failure" cases) can be anywhere. And when you terminate, you don't want that "fail" verdict to occasionally turn into an "error" due to a DTE occurring due to the somehow race-ful shutdown of the component hierarchy.
So I'm still wondering what the "correct" approach here is to avoid all of the above, or where my misunderstanding lies.
|
|
|
Re: Strategies for shutdding down complex component hierarchy [message #1848376 is a reply to message #1837302] |
Tue, 30 November 2021 10:28 |
|
The problem exists till this day, and we see no real solution proposed so far.
To this point, it's not even clear to us whether TITAN behaves as TTCN3 intended? Given that TITAN is the only TTCN3 implementation we have experience with, it's hard to know.
Extending every component type with a dedicated port to a "shutdown controller" sounds like a very invasive work-around - and it is also not implemented in those components that the TITAN project is releasing/providing.
So what is the solution?
From the user point of view, some kind of "freezing" of the verdict state might be the most simple approach: Once you know your test has succeeded, you freeze/lock the verdicts, and whatever errors might happen later during the complex shutdown of tons of components doesn't matter anymore, the verdict will not be affected anymore after the freeze.
[Updated on: Tue, 30 November 2021 10:36] Report message to a moderator
|
|
| |
Re: Strategies for shutdding down complex component hierarchy [message #1848388 is a reply to message #1848380] |
Tue, 30 November 2021 14:05 |
|
Hi Olaf,
Olaf Bergengruen wrote on Tue, 30 November 2021 13:18Hi all,
For this reason at the beginning of each test case we ask the user to switch off the UE (and / or take the batteries out, clean the SIM card) and switch on again, and the complete TTCN executable is started again and all HW is resetted.
To clarify: This topic is not about resetting state in the IUT. It is about errors occurring in the ATS (TTCN3 test suite) after the actual testcases have completed. Those errors occur with a certain probability due to a complex component hierarchy with e.g. timers triggering messages betewen ATS components. So while the MTC starts to stop, or even during "all component.stop" the usual recipient of some itnernal message suddenly no longer exists, as the recipient has been stopped before the sender. -> boom.
As there is no way to "atomically" stop all components (and the shutdown order of components being non-deterministic) there is always a certain probability that one of those components is creating a DTE at some point during the shutdown process.
We have been seeing this ever since we started to use TITAN years ago, and it is the single most constant annoyance during all those years.
If the ATS has reached a state where the end of the MTC is reached in "pass", then nothing happening during shutdown should still negatively affect the test result. It is guaranteed to be a non-issue.
I think the same problem must appear in any reasonably complex test suite with a component hierarchy where parts of the ATS implement various layers of protocol stacks. There are timers, asynchronous messaging, etc. happening in this stack, and without a way to atomically shut all of them down, it is impossible to guarantee that no problem will happen during shutdown.
|
|
| | |
Re: Strategies for shutdding down complex component hierarchy [message #1848572 is a reply to message #1848545] |
Wed, 08 December 2021 11:27 |
|
Hi Olaf,
thanks a lot for your follow-up, it is much appreciated. I was not aware of the "all component.stop" and "any component.killed" constructs. We will investigate it.
From what I can understand, the only way how this construct would improve the situation, is if the notification of stopping/killing one component is processed at higher priority than any other messages of internal test ports between components.
I would think the gravity of the problem highly depends on the depth and complexity of component hierarchy. Particularly if you have many "non testcase logic" implemented in your ATS, i.e. entire protocol stacks as intermediate layers inside the ATS, the probability increases that some timer somewhere expires, causing a message to be sent on an internal test port between components, which in turn may fail as the recipient component might have been killed in a race condition just before the sender component was killed.
What I'm wondering is: Whatever mechanism is the "best common practice" out there: Why is it not implemented in the official TITAN components, suhc as for example the titan.ProtocolEmulations.SCCP or .M3UA? I would appreciate if anyone from the TITAN project could comment on that. How is one supposed to prevent any race condition during component shutdown when using those as-is?
Best Regards,
Harald
|
|
| | |
Re: Strategies for shutdding down complex component hierarchy [message #1848605 is a reply to message #1848604] |
Thu, 09 December 2021 14:45 |
|
Olaf, thanks a lot for your efforts!!
but .... seriously?
one separate 'done' variable and related done clause for each PTC?
This is ugly, and it may work only for the most simplistic cases.
What about test suites that spawn complex hierarchies of many dozens or even hundreds of PTC, and do so dynamically, based on configuration files? In such situations it is impossible to alter the code of the already compiled test suite.
Any solution or work-around for the problem must scale automatically to any complex component hierarchy, without having to handle each new PTC with explicit additional variables and code.
|
|
| |
Re: Strategies for shutdding down complex component hierarchy [message #1848611 is a reply to message #1848609] |
Thu, 09 December 2021 17:11 |
|
Olaf Bergengruen wrote on Thu, 09 December 2021 17:16In my view, the shut down process must be designed in the test suite from the beginning and not afterwards.
Olaf
I think that is a valid (but probably contestable) position to take.
However, particularly with that position in mind: This brings be back to the question further up in the thread, how one is supposed to to that when using official TITAN modules such as the components of the M3UA and SCCP Emulation?
I would appreciate any feedback from TITAN team on this.
|
|
| |
Re: Strategies for shutdding down complex component hierarchy [message #1848620 is a reply to message #1848616] |
Thu, 09 December 2021 21:33 |
|
Hi Gabor,
thanks a lot for your feedback
Gábor Szalai wrote on Thu, 09 December 2021 19:36The shutdown of a complex system should be designed from the beginning.
This seems to be the message I'm getting from various parties here. I still have a bit of a hard time wrapping my head around the _why_. Why would one spend a lot of time for something mundane as the component shutdown signaling/ordering? After all, at some point the MTC has concluded the test is "over" as it completes. Why can all the other PTC not simply be terminated automatically/implicitly in any random order, without further impact on the verdict?
I'm trying to understand the benefit of requiring everyone to have complex explicit code for nothing else but making sure no stray message on some random internal test port causes a DTE _ after the actual test case (MTC) has concluded_.
In my poor user point of view, I would expect it is the task of the language and runtime to enable the test developer to be as productive as possible, and spending no unneeded time writing complex code for what happens after the actual test has succeeded (or failed).
Gábor Szalai wrote on Thu, 09 December 2021 19:36
Please not that the M3UA and SCCP Emulation are designed as a standalone components connected with a simple test case. Also they were written about 15 years ago. They can be extended and modified to support more complex system. The easiest modification is to use try-catch block to avoid the DTE
Some thoughts:
* The age of a component/module should not matter, unless the language / runtime has introduce the danger of DTE during shutdown only recently. In fact, an older component could very well be more evolved/mature
* if every libraray/module has to deal with the shutdown order, and if library/module/components are supposed to be re-usable across projects and entities, then ther must be some kind of standardization for orderly shutdown. Otherwise, no single library/module could ever be re-used in another project, as everyone would come up with their own incompatible strategy of "orderly shutdown"
* in general, irrespective of the programming language, I have a strong resistance against modifying upstream libraries/modules. This introduces additional maintenance for keeping out-of-mainline patches, they need to be forward-ported and re-tested whenever upstream changes -> maintenance nightmare.
Gábor Szalai wrote on Thu, 09 December 2021 19:36
Also the DTE during the shutdown can be avoided to use alive type components. The test ports of the alive components are not disconnected/unmapped when the component finished only when killed or the MTC terminated.
Interesting idea. Maybe that could be a workaround, will investigate.
I still think that some kind of language or runtime support [like the "freezing of verdicts"] would avoid a lot of extra complexity that every developer has to write, and as stated above, which even impairs the effective re-use of existing modules due to lack of standardized ways of handling the orderly shutdown.
|
|
|
Goto Forum:
Current Time: Thu Oct 10 01:56:17 GMT 2024
Powered by FUDForum. Page generated in 0.06293 seconds
|