|[ecf-dev] How to deal with recovering an ecf remote service connection|
I am working on the stability of the remote services network for our live escape game (a bomb diffusing simulation) and I have some general questions on how to deal with recovering an ecf remote service connection.
Let me explain what I have done so far and where I am at the moment. I’ll start with the setup:
- Equinox runtime
- ECF 3.12.0
- Distribution Provider: generic
- Discovery Provider: zeroconf
I use remote services in a pretty basic way:
- No dependencies on other services
- Only sync calls with a small payloads
- Tracking of services with a ServiceTracker (no ds)
The system consists of gadgets that run on Raspberry Pis (Model B+ or 2) and a desktop application for the game operator that shows all the sensor states in the network. The operator application uses whiteboard services that get informed when the state of some sensor changes. There is also a game timer (the bomb timer) that notifies the operator (and all other interested whiteboard services) when the countdown changes. Additionally the gadgets host remote services that let the operator remote control them if necessary (i.e. trigger an actor, add some time to the countdown, etc.).
I have tested that the tracking of services works smoothly. You can stop any application and the services of this host will be unregistered correctly. If you restart the application the services will be registered again and whiteboard services will be picked up as is to be expected. You can also just kill a device (cut the power of a pi) which will result in the same thing.
We have been using this system now for a couple for weeks on site and I got some reports of gadgets dropping out of the network for no apparent reason when the system is running for a longer time or not showing up on startup. This is pretty bad for the operator, as he usually has to reboot the whole system and try again to recover from this situation. I suspect that this has to do with the network situation on site, as they use pretty cheap equipment and there is also a lot of network traffic going on (they also have lan cameras and stuff like that). So I assume that the network connection is not very stable.
I did some more tests on this over the weekend. I tried to simulate the situation by temporarily removing the network connection by pulling the LAN cable. This was pretty interesting as it lead to some problems at first. I did two things to remedy this:
1) I am now using a different Thread (Executor) on the gadgets to do the actual remote calls, so that the application thread is not directly affected.
2) I set ecf.remotecall.timeout to a pretty short value (100 ms). I'm not sure why, but using a longer value (3000 ms) lead to a situation where the application did not recover very well when a lot of remote calls where waiting for the timeout at the same time (tested on a Raspberry Pi 2).
With these two changes the connection stays stable, even when I "pull the plug" for a couple of seconds. As soon as I reconnect, the connection resumes nicely. After 30 seconds, the connection breaks down which makes sense as this is the default keepalive value (if I understand this property correctly). In this case the services are unregistered which is picked up by the ServiceTracker. But after this, there is no recovery when I reconnect the device to the network and I have to restart the application.
And here come the questions I have at the moment:
Is this the behavior you would expect?
If yes, is there a way to remotely trigger a “re-discovery” (I know the IP address of the device I am looking for)?
Or would you expect the discovery mechanism to find the reconnected host and import its services again?
Apart from this, if you have any tips on that to look for or how to simulate a “bad network”, these are highly appreciated.
Thanks for the support!
Tel: 089 / 996547-26
Fax: 089 / 996547-99
Geschäftsführer: Dipl.-Inf. (FH) Georg Engl UST-Id-Nr.: DE 131 175 644, HRB 80271 München