Home » Modeling » Compare » On suboptimal decisions over the match solution spectrum
On suboptimal decisions over the match solution spectrum [message #1071171] 
Fri, 19 July 2013 13:45 
Victor Roldan Betancort Messages: 524 Registered: July 2009 
Senior Member 


Hi folks,
been lately working on improving the accuracy of the EMF Compare proximity engine. I've found the current implementation falls into suboptimal solutions at the matching stage. My point is to discuss why I think this is happening and if the EMF Compare team agrees it could happen. I'd be happy to be proven wrong, but I've debugged several scenarios in our testcases that proves this is actually happening.
When choosing the best match for a given EObject, the proximity engine matches a pair with the shortest distance, where distance is a numeric value that determines how similar are two objects. This can be found at ProximityIndex.findTheClosest(Comparison, EObject, Side, Side, boolean).
This approach leads to a suboptimal set of matches, because the distance is optimized locally (per EObject) instead of globally (the smallest overall distance for a set of match decisions).
I'll try to describe an example, we have 3 elements of the same type A, B and C, and their counterparts in the "other side" A', B' and C'. I describe the combination of matches and their distances. The less distance points, the better.
A > A' = 8 points
A > B' = 5 points
A > C' = 10 points
B > A' = 10 points
B > B' = 1 points
B > C' = 7 points
C > A' = 7 points
C > B' = 5 points
C > C' = 3 points
Currently the engine would explore the match solution space by iterating all combinations in an arbitrary order (I believe in order of the EObjects in their containment feature).
If we choose in order, we would be lead to the following match solution:
A > B'
B > C'
C > A'
which gives a global distance of 5 + 7 + 7 = 19. Since A matches B', B' gets removed from the match space, so B can only match A' or C'. Finally C matches A'.
This is a simplification of what actually happens. EMF Compare introduces the "double check" concept, which basically does is "if B' is the best match for A, check if A is the best match for B'". Matches are not commutative: if B' is the best match for A, it does not imply A is the best match for B'. I wont go through the result of the double check logic in the ProximityIndex class for the above example scenario, but I can say it would actually lead to an even worst results: the engine will decide A has no proper match in the solution space because no double check succeeds.
If instead we choose the following match combination:
A > A'
B > B'
C > C'
we would obtain a global distance of 8 + 1 + 3 = 12. This is actually the correct match combination and also the lowest global distance, which means it is the optimal match solution over all possible combinations.
Hope I've been clear. I think the implementation of a global distance optimization algorithm would improve the engine accuracy, but is not trivial. A combinatory exploration algorithm, searching for the best option, is not a feasible approach since it would be computationally unacceptable (here we have 3 x 3 match combinations, think of a match space with 1.000.000 elements). Therefore, some optimization heuristic would be more desirable, also having the problem that it may lead to suboptimal decisions, but would execute faster.
Do you guys think the above conclusion is logical / coherent, or did I make some wrong assumption? Granted, were you aware of this situation? Any plans on this regard in case you were aware?
Best Regards,
Víctor Roldán [Open Canarias]
[Updated on: Fri, 19 July 2013 14:50] Report message to a moderator


  
Goto Forum:
Current Time: Wed Dec 07 16:34:15 GMT 2016
Powered by FUDForum. Page generated in 0.03200 seconds
