Eclipse Community Forums: Papyrus for Real Time » UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults

Home » Eclipse Projects » Papyrus for Real Time » UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults

UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults [message #1777778]

Mon, 04 December 2017 10:38

Eclipse User

There's been a lot of talk about the "Let it Crash" approach to handling faults in large-scale distributed systems using languages such as Erlang and Akka. E.g.,
- https://www.developer.com/mgmt/akka-in-action-let-it-crash.html,
- https://cacm.acm.org/magazines/2017/11/222167-hootsuite/fulltext

It seems to me that UML-RT would also support this approach quite nicely (assuming that, e.g., crashes of a capsule part and be detected easily by the owning parent).

What is your take on that?

Thanks

Juergen

Re: UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults [message #1777823 is a reply to message #1777778]

Mon, 04 December 2017 18:15

Eclipse User

That's a very interesting question. My take on it is that while UML-RT and Papyrus-RT do have some machinery that could be helpful in supporting fault-tolerance, but neither was designed with this explicitly.

As the Akka article describes, if you want to manage a capsule and handle potential crashes, you need several things. First, let's start with defining "crash". There could be several possible behaviours that could be considered a crash. For simplicity, I'll consider two kinds: "bad state" and "catastrophe". The first is more benign and corresponds to safety or liveness issues like a (set of) capsule(s) ending up in deadlock or ending in a bad state (hence the name) where it shouldn't be. The second kind corresponds to the capsule really crashing (e.g. action code resulting in a segfault).

Both kinds face similar issues. From the point of view of a supervisor, in both cases you face the problem of detecting lack of responsiveness. UML-RT and Papyrus-RT do not have explicit support for this. We could get pretty philosophical about this: how do you know that something isn't there? how do you detect the abscence of something?. If I recall correctly, Erlang had a mechanism in which when a process fails, all its neighbours are informed. We do not have anything equivalent built-in, and therefore, the burden is on the modeller. Of course, you can always use timeouts, but that still puts the burden on the modeller, at least on the supervisor side. Of course, we could envision changes to the core language, and the RTS to support such things, but there are no plans for this at the moment.

There are differences between these kinds as well. In the bad state case, if a capsule ends up in such state, the effects are limited to the capsule (although it may have behavioural consequences for other capsules too, of course). But in the catastrophic case, if a capsule crashes, say with a segfault, the thread crashes and the controller dies, bringing down with it all capsules managed by that controller. This raises a lot of issues about how to handle such scenario.

Aside from detecting failure, the articles also mention several strategies that a supervisor could use to handle it, such as suspending, terminating, escalating, etc. Some of these strategies require certain constructs in the language or functionality in the RTS. For example, to terminate a capsule, it must be done with the destroy operation, but this is applicable only to optional parts. If the capsule in question lives in a fixed part, there is no operation that supports destroying that instance alone. So again, the burden of deciding which capsules are allowed to fail falls on the modeller.

In summary, I think that the core language may provide the ground to develop fault-tolerant systems, but it does not provide full support for this, and a lot of the responsibility falls on the modeller. To support fault-tolerance, some things would need to be done to the RTS and maybe the language itself.

Re: UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults [message #1777891 is a reply to message #1777823]

Tue, 05 December 2017 11:18

Eclipse User

Hm, your response sounds a little to negative for my taste.

Overall, it does appear that, when, e.g., a certain 'pattern' is followed (i.e., certain UML-RT constructs are used or not used and the model
is designed in a certain way), meaningful support of the 'Let it Crash' strategy can be provided. In particular, when a 'crash' is understood
more as an error or fault situation, rather than, e.g, a seg fault during capsule execution (but even these 'catastrophic' case could be handled,
if the code for a failing capsule runs runs in its own process (so, deploying the system as a distributed application as proposed by Karim's work
would help)).

Juergen

Re: UML-RT/Papyrus-RT and the "Let it Crash" approach to handling faults [message #1777895 is a reply to message #1777891]

Tue, 05 December 2017 12:11

Eclipse User

I'm not trying to sound negative ;) I just want to make the point that the language and its implementation do not have built-in support for this. Of course, if you follow certain patterns you can create fault-tolerant systems, very much in the same way that you could create a fault-tolerant system in say Java or C++ even though they were not designed for that. It means that the burden is on the modeller to use the appropriate patterns, but the language and runtime provide only limited support. In the UML-RT case, I would imagine that such patterns would involve putting capsules under observation in optional parts that could be destroyed, so UML-RT does provide a concept and constructs to do that: creating and destroying capsule instances. But what it doesn't provide you is with a mechanism to detect failure or crashing.

I think this is potentially a very interesting area for improvement (and possibly some research projects). If we limit ourselves to the "bad state" case I described in my previous message, I see at least one possible way to deal with that, at least for individual capsules: you may have noticed that if you send a message to a capsule and it is not in a state ready to accept it, you will get an error message in the console. If you inspect the generated code, you will see that basically the default case of the switch statement for the state will invoke "this->unexpectedMessage();". Nothing prevents you from overriding the "unexpectedMessage" method in the capsule, so you could override it with actions that, for example, send a message either to the original sender or to all the capsule's neighbours, or to a designated supervisor. This could model the "I failed" case: a supervisor can have a timer and every now and then probe the capsule to see if it gets back the "I failed" message. Alternatively, you could mark some states explicitly as "failure" states, and either the modeller, or a code generator extension could send an "I failed" message when reaching one of those states. This way, a supervisor could detect "bad state" (non-catastrophic) failure, similar to an exception, and then it could decide and apply some strategy like destroy and replace.

So in summary, the language does provide some means that can help with this, but some of the machinery is not built-in and would require either effort on the part of the modeller, or improvements to the language, the code-generator and/or the runtime.

[Updated on: Tue, 05 December 2017 12:15] by Moderator

Previous Topic:	Duplicate internal signals are generated if a state has sub-states
Next Topic:	Local transition is not code-generarted

Goto Forum:

-=] Back to Top [=-

Current Time: Tue Jul 01 19:28:04 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter