Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » Compare » Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject b)
Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject b) [message #1066439] Tue, 02 July 2013 16:30 Go to next message
Victor Roldan Betancort is currently offline Victor Roldan Betancort
Messages: 524
Registered: July 2009
Senior Member
Hi again,

here I go again with yet another shareable customization Razz

During development of our EMF Compare 2.1.0 based application, our effort mainly focused on improving the results returned by the engine. When no identification match is possible, the ProximityMatcher comes into play (in case delegation is enabled).

The ProximityMatcher ends up using the EditionDistance class, which actually measures the difference between two EObjects. Measuring difference is quite a complex task, and the default implementation does not always give good results (not blaming anyone, as I said, it's quite a complex task Razz)

During our analysis, we found that a simple restriction in the EditionDistance.distance(Comparison, EObject, EObject) method was leading to many unmatched EObjects:

public double distance(Comparison inProgress, EObject a, EObject b) {
  this.uriDistance.setComparison(inProgress);
  double maxDist = Math.max(getThresholdAmount(a), getThresholdAmount(b));
  double measuredDist = new CountingDiffEngine(maxDist, this.fakeComparison).measureDifferences(inProgress, a, b);
  if (measuredDist > maxDist) {
    return Double.MAX_VALUE;
  }
  return measuredDist;
}


The last restriction:
  if (measuredDist > maxDist) {
    return Double.MAX_VALUE;
  }


was the culprit of our problems. What does that restriction means? For what I understand, in simple terms it basically says "if the match is not good enough, discard it". It returns a huge distance value in case the actual computed distance goes beyond a pre-calculated threshold.

Once we deactivated that restriction, all comparisons magically started to become way more accurate. The possible reason? Well, our models under comparison may have some characteristics that leads to a low threshold (maxDist). And that leads to a reduced subset of matches being possible. By removing that restriction we are allowing the engine to become more "relaxed" or "lax" with the matches.

I found this through experimentation, and I can't really say if this is generalizable. But what I know is that engine always selects the lowest distance match. Then, why discarding matches going beyond a threshold? Maybe you have nothing better than that.

I guess the same results could also be achieved by increasing the threshold. But it is the EClass shape what determines the threshold, and (unfortunately?) the current code does look pretty empirical (ReflectiveWeightProvider class).

Its hard to suggest a change in the code when it just works "statistically good enough". But, given the restriction described above (the engine takes the lowest distance always), the change looks harmless to me, and potentially covenient to clients with models with shapes that leads to low thresholds (essentially, model with low EStructuralFeature count, and where most are EReferences and just a few EAttributes).

Again, hope I was clear enough. I'd like to get rid of third-party code customizations whenever possible Razz

Best Regards,
Víctor.
Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066566 is a reply to message #1066439] Wed, 03 July 2013 08:07 Go to previous messageGo to next message
Cedric Brun is currently offline Cedric Brun
Messages: 371
Registered: July 2009
Senior Member
Hi Victor, and thanks for your feedback, its really valuable to us.

You're right when you say these "tunings" are empiricaly set. These are based on the test data and model instances we have in our tests, these coefficients and thresholds are giving the same results that EMF Compare 1 *on our dataset*.

I'll explain a bit more how it works :

The WeightProvider return a weight for feature. This allow us to express things like :
a feature named "name" or "id" is likely to be important when matching.

org.eclipse.emf.compare.match.eobject.internal.ReflectiveWeightProvider.getWeight(EStructuralFeature)


The threshold is here to express : "up to this point, the object are so different I don't want them to be considered matches." Why is this needed ? If you don't have such kind of threshold and you add and remove elements, then EMF Compare will always match elements even if they have very little in common of course detecting many differences to transform one element to the other. This is very hard to interpret for the end user.

How is this threshold calculated in the current codebase ?
We go through every feature which is set, get its weight and compute what we could call the "maximum distance possible for those two EObjects".

Then we apply a ratio on this depending on the number of features which are set (see thresholds array), we've seen that if only 2 features are set, we want the threshold to be quite high (0.6*maxdistance) and it gets progressively lower up to 0.465.

Which could be roughly rephrased as : "if my object which has only 2 features which are set and has less than 40% in common with the other one I don't want them to match."

Using compare 2.x on other use cases I noticed it tend to "match less" than compare 1.x. We need more tuning and to do so more data. If you can give us the threshold which would work best for your case that would help a lot. I would start by seeing through trace/debug how many features are set in general when you enter the "if (measuredDist > maxDist) " condition, then tweak the org.eclipse.emf.compare.match.eobject.EditionDistance.thresholds table adding specific thresholds for these cases.


Now for your actual question :
If you want to bypass this threshold with the minimum of code change, you can subclass EditionDistance and override org.eclipse.emf.compare.match.eobject.EditionDistance.getThresholdAmount(EObject) by returning Double.MAX_VALUE.
Then the threshold will have no effect and EMF compare will always find a match if there is another object of the same type.

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066619 is a reply to message #1066566] Wed, 03 July 2013 13:22 Go to previous messageGo to next message
Victor Roldan Betancort is currently offline Victor Roldan Betancort
Messages: 524
Registered: July 2009
Senior Member
Hi Cedric,

thank you very much for such a clear and detailed explanation! This gives me a better insight of the distance/weight calculation mechanisms.

I understand that the restriction cannot be happily removed, as it may lead to match everything. I also see that its NOT trivial to determine when 2 EObjects are so different that can be considered not the same EObject.

I'll invest some time trying to see if I can tune better the threshold array for our use case. Given that, it looks to me that this fine tunning could be a recurrent customization for many clients, and therefore it may be desirable to have the class properly designed to customize the threshold array. Maybe users need to specialize getThresholdRation() (instead of customizing getThresholdAmount()), so it could be promoted to a protected method. What do you think?

And, in second place, I already implemented your last suggestion and my tests seem to pass. I fear comparison may take a bit more, as maxDist is passed to CountingDiffEngine:

new CountingDiffEngine(maxDist, this.fakeComparison).measureDifferences(inProgress, a, b);

which (correct me if wrong) is basically passed to avoid computations in case the measures are bigger than maxDist value.

Awayway, my actual point is the following: to pass a EditionDistance subclass, I need to initialize my code as follows:

final EditionDistance editionDistance = new CustomEditionDistance();
final CachingDistance cachedDistance = new CachingDistance(editionDistance);		
IEObjectMatcher fallBackMatcher = new ProximityEObjectMatcher(cachedDistance);
IdentifierEObjectMatcher customIDMatcher = new IdentifierEObjectMatcher(fallBackMatcher, createIdentificationFunction());

The above code is specific to our usage scenario, but the issue here is that CachingDistance is internal. It seems to be generic enough to be promoted to public, and specially when EditionDistance is public.

If you agree with the proposed changes, I can create two bugzillas and commit them to gerrit (now that Laurent very kindly explained me how to do it Razz)

Again, thanks for your feedback!
Víctor Roldán Betancort
Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066797 is a reply to message #1066619] Thu, 04 July 2013 12:39 Go to previous messageGo to next message
Cedric Brun is currently offline Cedric Brun
Messages: 371
Registered: July 2009
Senior Member

> . Maybe users need to specialize getThresholdRatio() (instead of customizing getThresholdAmount()), so it could be promoted to a protected method. What do you think?

Sure it might be user to only override this, and CachingDistance should be made public, you're right.

Please go ahead, open the bugzilla and submit the changes to gerrit.

>is basically passed to avoid computations in case the measures are bigger than maxDist value.

That's right, it is passed so that we stop comparing the objects if we went above the matching limit and so we know we won't consider them the same anyway.
But in the end, that's what you are trying to achieve isn't it ? Trying to match as hard as possible ? In your customization, if you just remove the "if (measuredDist > maxDist) {" without changing maxDist in the first place, then the counting diff engine would stop anyway. Without doing that, you could have several objects which have the same distance computed (or very close) just because we stopped computing it and match the wrong one.

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066819 is a reply to message #1066797] Thu, 04 July 2013 14:05 Go to previous message
Victor Roldan Betancort is currently offline Victor Roldan Betancort
Messages: 524
Registered: July 2009
Senior Member
Quote:

Sure it might be user to only override this, and CachingDistance should be made public, you're right.
Please go ahead, open the bugzilla and submit the changes to gerrit.


Created:
Bug 412315 - Increase visibility of EditionDistance.getThresholdRatio(int) to protected https://bugs.eclipse.org/bugs/show_bug.cgi?id=412315
Bug 412316 - Make CachingDistance public https://bugs.eclipse.org/bugs/show_bug.cgi?id=412316

Quote:

That's right, it is passed so that we stop comparing the objects if we went above the matching limit and so we know we won't consider them the same anyway.
But in the end, that's what you are trying to achieve isn't it ? Trying to match as hard as possible ? In your customization, if you just remove the "if (measuredDist > maxDist) {" without changing maxDist in the first place, then the counting diff engine would stop anyway. Without doing that, you could have several objects which have the same distance computed (or very close) just because we stopped computing it and match the wrong one.


I see. So what is it the returned measuredDist in CountingDiffEngine? if it stops, the value is... maxDist always? Or it does return the actual distance between the objects?
I wonder how my use cases worked properly then... random luck?


Thanks Cedric!
Víctor.
Previous Topic:Enhacement proposal on IdentifierEObjectMatcher
Next Topic:Need help/guidance in using emf-compare standalone to compare two xml files based on a model
Goto Forum:
  


Current Time: Thu Oct 23 01:53:59 GMT 2014

Powered by FUDForum. Page generated in 0.17632 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software