An update on this.
I have rebased my POC on 1.4 current as of the commits made today. I'm only running locally on my mac at this point.
Retained storage is handled solely by Cassandra in the POC, although in the real world Cassandra should probably handle only subtrees specified by configuration: "global" retained storage.
It passes all broker tests to the same degree as "vanilla" 1.4 on my mac (no TLS) except for those involving "bridge": 06-bridge-*.py and 08-ssl-bridge.py. Obviously I don't understand bridge well enough and have to study the code/docs more to fix this.
The mosquitto<->retain_server interaction is fully asynchronous using ZMQ. I augmented the IO loop to handle "retain inserts" from the retain_server in response to asynchronous subscriptions from mosquitto. Storing a retained message in mosquitto pushes it to the retain_server and clears the retain bit in mosquitto. The real world implementation will need a bit more stuff flying back and forth, e.g. heartbeats.
It's not much code. Maybe 50 lines of mods to the existing source plus a few hundred lines for the retain_server glue module. I did have to modify mosq_test.py and some tests to start/stop/clear the retain_server as needed. The python retain_server is more complicated than the mosquitto mods due to query planning. There is more grunt work to be done there to generalize, modularize and parallelize but the effort is straightforward and the algorithms are now proven.
I can see that this is possible now. So we are re-imagining our current nytfabrik which is based upon custom websocket gateways and message protocols. It's actually very simple: we autoscale gateways in multiple AWS regions - anywhere between 10 and 100. They are all connected to a global data store and a global message bus. I hope to do an MQTT version to run in parallel and compare.
I'm unsure about timing from this point, but the next step will probably be to involve others on the team and move to a dev environment in AWS.
ml