Eclipse Community Forums: Compare » Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject b)

Home » Modeling » Compare » Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject b)

Show: Today's Messages :: Show Polls :: Message Navigator

Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject b) [message #1066439]

Tue, 02 July 2013 16:30

Victor Roldan Betancort

Messages: 524
Registered: July 2009

Senior Member

Hi again,

here I go again with yet another shareable customization Razz

During development of our EMF Compare 2.1.0 based application, our effort mainly focused on improving the results returned by the engine. When no identification match is possible, the ProximityMatcher comes into play (in case delegation is enabled).

The ProximityMatcher ends up using the EditionDistance class, which actually measures the difference between two EObjects. Measuring difference is quite a complex task, and the default implementation does not always give good results (not blaming anyone, as I said, it's quite a complex task Razz

)

During our analysis, we found that a simple restriction in the EditionDistance.distance(Comparison, EObject, EObject) method was leading to many unmatched EObjects:

public double distance(Comparison inProgress, EObject a, EObject b) {
  this.uriDistance.setComparison(inProgress);
  double maxDist = Math.max(getThresholdAmount(a), getThresholdAmount(b));
  double measuredDist = new CountingDiffEngine(maxDist, this.fakeComparison).measureDifferences(inProgress, a, b);
  if (measuredDist > maxDist) {
    return Double.MAX_VALUE;
  }
  return measuredDist;
}

The last restriction:

  if (measuredDist > maxDist) {
    return Double.MAX_VALUE;
  }

was the culprit of our problems. What does that restriction means? For what I understand, in simple terms it basically says "if the match is not good enough, discard it". It returns a huge distance value in case the actual computed distance goes beyond a pre-calculated threshold.

Once we deactivated that restriction, all comparisons magically started to become way more accurate. The possible reason? Well, our models under comparison may have some characteristics that leads to a low threshold (maxDist). And that leads to a reduced subset of matches being possible. By removing that restriction we are allowing the engine to become more "relaxed" or "lax" with the matches.

I found this through experimentation, and I can't really say if this is generalizable. But what I know is that engine always selects the lowest distance match. Then, why discarding matches going beyond a threshold? Maybe you have nothing better than that.

I guess the same results could also be achieved by increasing the threshold. But it is the EClass shape what determines the threshold, and (unfortunately?) the current code does look pretty empirical (ReflectiveWeightProvider class).

Its hard to suggest a change in the code when it just works "statistically good enough". But, given the restriction described above (the engine takes the lowest distance always), the change looks harmless to me, and potentially covenient to clients with models with shapes that leads to low thresholds (essentially, model with low EStructuralFeature count, and where most are EReferences and just a few EAttributes).

Again, hope I was clear enough. I'd like to get rid of third-party code customizations whenever possible Razz

Best Regards,
Víctor.

Report message to a moderator

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066566 is a reply to message #1066439]

Wed, 03 July 2013 08:07

Cedric Brun

Messages: 431
Registered: July 2009

Senior Member

Hi Victor, and thanks for your feedback, its really valuable to us.

You're right when you say these "tunings" are empiricaly set. These are based on the test data and model instances we have in our tests, these coefficients and thresholds are giving the same results that EMF Compare 1 *on our dataset*.

I'll explain a bit more how it works :

The WeightProvider return a weight for feature. This allow us to express things like :
a feature named "name" or "id" is likely to be important when matching.

org.eclipse.emf.compare.match.eobject.internal.ReflectiveWeightProvider.getWeight(EStructuralFeature)

The threshold is here to express : "up to this point, the object are so different I don't want them to be considered matches." Why is this needed ? If you don't have such kind of threshold and you add and remove elements, then EMF Compare will always match elements even if they have very little in common of course detecting many differences to transform one element to the other. This is very hard to interpret for the end user.

How is this threshold calculated in the current codebase ?
We go through every feature which is set, get its weight and compute what we could call the "maximum distance possible for those two EObjects".

Then we apply a ratio on this depending on the number of features which are set (see thresholds array), we've seen that if only 2 features are set, we want the threshold to be quite high (0.6*maxdistance) and it gets progressively lower up to 0.465.

Which could be roughly rephrased as : "if my object which has only 2 features which are set and has less than 40% in common with the other one I don't want them to match."

Using compare 2.x on other use cases I noticed it tend to "match less" than compare 1.x. We need more tuning and to do so more data. If you can give us the threshold which would work best for your case that would help a lot. I would start by seeing through trace/debug how many features are set in general when you enter the "if (measuredDist > maxDist) " condition, then tweak the org.eclipse.emf.compare.match.eobject.EditionDistance.thresholds table adding specific thresholds for these cases.

Now for your actual question :
If you want to bypass this threshold with the minimum of code change, you can subclass EditionDistance and override org.eclipse.emf.compare.match.eobject.EditionDistance.getThresholdAmount(EObject) by returning Double.MAX_VALUE.
Then the threshold will have no effect and EMF compare will always find a match if there is another object of the same type.

http://cedric.brun.io news and articles on eclipse and eclipse modeling.

Report message to a moderator

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066619 is a reply to message #1066566]

Wed, 03 July 2013 13:22

Victor Roldan Betancort

Messages: 524
Registered: July 2009

Senior Member

Hi Cedric,

thank you very much for such a clear and detailed explanation! This gives me a better insight of the distance/weight calculation mechanisms.

I understand that the restriction cannot be happily removed, as it may lead to match everything. I also see that its NOT trivial to determine when 2 EObjects are so different that can be considered not the same EObject.

I'll invest some time trying to see if I can tune better the threshold array for our use case. Given that, it looks to me that this fine tunning could be a recurrent customization for many clients, and therefore it may be desirable to have the class properly designed to customize the threshold array. Maybe users need to specialize getThresholdRation() (instead of customizing getThresholdAmount()), so it could be promoted to a protected method. What do you think?

And, in second place, I already implemented your last suggestion and my tests seem to pass. I fear comparison may take a bit more, as maxDist is passed to CountingDiffEngine:

new CountingDiffEngine(maxDist, this.fakeComparison).measureDifferences(inProgress, a, b);

which (correct me if wrong) is basically passed to avoid computations in case the measures are bigger than maxDist value.

Awayway, my actual point is the following: to pass a EditionDistance subclass, I need to initialize my code as follows:

final EditionDistance editionDistance = new CustomEditionDistance();
final CachingDistance cachedDistance = new CachingDistance(editionDistance);		
IEObjectMatcher fallBackMatcher = new ProximityEObjectMatcher(cachedDistance);
IdentifierEObjectMatcher customIDMatcher = new IdentifierEObjectMatcher(fallBackMatcher, createIdentificationFunction());

The above code is specific to our usage scenario, but the issue here is that CachingDistance is internal. It seems to be generic enough to be promoted to public, and specially when EditionDistance is public.

If you agree with the proposed changes, I can create two bugzillas and commit them to gerrit (now that Laurent very kindly explained me how to do it Razz

)

Again, thanks for your feedback!
Víctor Roldán Betancort

Report message to a moderator

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066797 is a reply to message #1066619]

Thu, 04 July 2013 12:39

Cedric Brun

Messages: 431
Registered: July 2009

Senior Member

> . Maybe users need to specialize getThresholdRatio() (instead of customizing getThresholdAmount()), so it could be promoted to a protected method. What do you think?

Sure it might be user to only override this, and CachingDistance should be made public, you're right.

Please go ahead, open the bugzilla and submit the changes to gerrit.

>is basically passed to avoid computations in case the measures are bigger than maxDist value.

That's right, it is passed so that we stop comparing the objects if we went above the matching limit and so we know we won't consider them the same anyway.
But in the end, that's what you are trying to achieve isn't it ? Trying to match as hard as possible ? In your customization, if you just remove the "if (measuredDist > maxDist) {" without changing maxDist in the first place, then the counting diff engine would stop anyway. Without doing that, you could have several objects which have the same distance computed (or very close) just because we stopped computing it and match the wrong one.

http://cedric.brun.io news and articles on eclipse and eclipse modeling.

Report message to a moderator

Re: Improving threshold usage on EditionDistance.distance(Comparison inProgress, EObject a, EObject [message #1066819 is a reply to message #1066797]

Thu, 04 July 2013 14:05

Victor Roldan Betancort

Messages: 524
Registered: July 2009

Senior Member

Quote:

Sure it might be user to only override this, and CachingDistance should be made public, you're right.
Please go ahead, open the bugzilla and submit the changes to gerrit.

Created:
Bug 412315 - Increase visibility of EditionDistance.getThresholdRatio(int) to protected https://bugs.eclipse.org/bugs/show_bug.cgi?id=412315
Bug 412316 - Make CachingDistance public https://bugs.eclipse.org/bugs/show_bug.cgi?id=412316

Quote:

That's right, it is passed so that we stop comparing the objects if we went above the matching limit and so we know we won't consider them the same anyway.
But in the end, that's what you are trying to achieve isn't it ? Trying to match as hard as possible ? In your customization, if you just remove the "if (measuredDist > maxDist) {" without changing maxDist in the first place, then the counting diff engine would stop anyway. Without doing that, you could have several objects which have the same distance computed (or very close) just because we stopped computing it and match the wrong one.

I see. So what is it the returned measuredDist in CountingDiffEngine? if it stops, the value is... maxDist always? Or it does return the actual distance between the objects?
I wonder how my use cases worked properly then... random luck?

Thanks Cedric!
Víctor.

Report message to a moderator

Previous Topic:	Enhacement proposal on IdentifierEObjectMatcher
Next Topic:	Need help/guidance in using emf-compare standalone to compare two xml files based on a model

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 19 20:36:07 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter