Home » Modeling » Compare » On EMF Compare 2.X scalability(some thoughts and questions on 2.x scalability)
On EMF Compare 2.X scalability [message #1051942] |
Mon, 29 April 2013 12:04 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Hi there,
aside from getting used to the new apis, I've faced new challenges on the 2.X stream.
My main concern is current matching algorithm's scalability. I got surprised to see how test-cases that used to take 15 seconds, now need 20 minutes to finalize.
I've debugged the code and compared it to 1.X, and found some differences that look to me as possible responsibles to this poor performance. Aparently, the scope of comparison has heavily increased on the new match engine. Inspecting ProximityIndex.findTheClosest(Comparison, EObject, Side, Side, boolean) shows how now the engine tries to match each element with every other possible element on the model with the same type (ByTypeIndex takes care of that classifying by type). You can check how the "storageToSearchFor" list contains all elements with the same type in the model as the object to match.
Oppossed to that, 1.X used content matching strategies to narrow the scope of the comparison. This means: we will only compare elements that are around the same point of the models structure, by doing a structural/content match. There was even a search window parameter to narrow/broad even more the match scope in case necessary. Although this could lead to some false positives, it was way more efficient.
For big models, the 2.x behaviour end up in almost 2 * n * n comparisons in the worst case, where n is the number of elements, and a potential 2 due to double check mechanism. Obiously, the new approach is safer in the sense that nothing remains out of the match calculations, but this appears not to scale on big models.
I got surprised to realise this, when scalability has always been a concern to the project.
So, the questions here are:
- Is it true what I've found or I'm missing something?
- Why is the model flattened and compared, and why was the content match of 1.x discarded?
- Any recommended way to avoid this exponential number of comparisons?
- Are there any possible optimizations to know?
Thanks in advance,
Víctor.
|
|
|
Re: On EMF Compare 2.X scalability [message #1052096 is a reply to message #1051942] |
Mon, 29 April 2013 16:08 |
Cedric Brun Messages: 431 Registered: July 2009 |
Senior Member |
|
|
Hi Victor,
"Is it true what I've found or I'm missing something?"
Yes it's pretty much true except the current implementation try to avoid "comparing too much" by looking for perfect identical objects with a cheap comparison before going further. As you said this approach allow us to have "better" results but can ends up being more "explosive" from a computational point of view. So far our efforts have been focused on the merge consistency and better scalability through scope reduction, content match got rewritten by reusing the Diff logic itself and assume most of the model elements are not changing.
Longer term, we intend to have a proper distance from this and then have the ability to sort the storageToSearchFor, that would lead to having the best of both worlds and quickly find the closest element.
This approach had not be implemented yet because of lack of time, but I would gladly review the work of anybody wanting to speed things up.
"Why is the model flattened and compared, and why was the content match of 1.x discarded?"
One of the key requirement of 2.x is the ability to compute differences operating on arbitrary scopes and not rely on the containment hiearchy. Furthermore we need to be sure to find the "best match" whenever thats possible.
- Any recommended way to avoid this exponential number of comparisons?
The figure you gives makes me think our "assumption" are not valid at all in your case and you ends up in the worst case..
Is your model "completely different" ? Are you using Featuremaps ?
ProximityIndex has a ProximityMatchStats object keeping track of the computation made during the process, what is it giving to you ?
- Are there any possible optimizations to know?
Many are coming to my mind, we could keep the objects in their insertion order and introduce a SEARCH_WINDOW again, reducing the "explosion", we can make sure the match process is not comparing references of the EObject avoiding the LCS computation during the match, re-implement a match engine mimicking the 1.x behavior... In any case it would be interesting for us to understand what is going on exactly with your models leading to this.
http://cedric.brun.io news and articles on eclipse and eclipse modeling.
|
|
|
Re: On EMF Compare 2.X scalability [message #1052597 is a reply to message #1052096] |
Tue, 30 April 2013 09:30 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Cedric,
some coments below
Quote:Yes it's pretty much true except the current implementation try to avoid "comparing too much" by looking for perfect identical objects with a cheap comparison before going further. As you said this approach allow us to have "better" results but can ends up being more "explosive" from a computational point of view. So far our efforts have been focused on the merge consistency and better scalability through scope reduction, content match got rewritten by reusing the Diff logic itself and assume most of the model elements are not changing.
Longer term, we intend to have a proper distance from this and then have the ability to sort the storageToSearchFor, that would lead to having the best of both worlds and quickly find the closest element.
This approach had not be implemented yet because of lack of time, but I would gladly review the work of anybody wanting to speed things up.
If I find something useful to share with you guys, I'll gladly do it. So far, the only thing I could come with is some sort of heuristic to reduce the size of storageToSearchFor by taking only elements with the same parent as the object subject to be matched. Not working perfectly anyway, but reduced the computation time.
Quote:
- Any recommended way to avoid this exponential number of comparisons?
The figure you gives makes me think our "assumption" are not valid at all in your case and you ends up in the worst case..
Is your model "completely different" ? Are you using Featuremaps ?
ProximityIndex has a ProximityMatchStats object keeping track of the computation made during the process, what is it giving to you ?
We are using the OMG's KDM metamodel. The test-case model I'm using is not that big (17mb XMI file), and there are not many differences. Neiher we are using FeatureMaps.
The objects are stored in a CDO Server instance before comparing. I used to have problems with CDO in the 1.X stream (regarding both scalability and reliability), but in 2.X both XMI and CDO take the same amount of time to compute the differences. However, I'm yet to determine whether the fact is stored in CDO may introduce false-positives (which happened quite a lot in the 1.X stream).
I took a look at the stats variable. As you know, its segmented by type (due to ByTypeIndex), but these are some of the most "conflictive" elements:
Addresses -> ProximityMatchStats [nbIndenticComparison=313935, nbMaxDistComparison=137085, nbnoMatch=104, nbSuccessIdenticComparison=3854, nbSuccessMaxComparison=32, nbBacktrack=240144, nbDoubleCheck=104, nbFailedDoubleCheck=72]
Reads -> ProximityMatchStats [nbIndenticComparison=642745, nbMaxDistComparison=474863, nbnoMatch=279, nbSuccessIdenticComparison=5366, nbSuccessMaxComparison=130, nbBacktrack=483938, nbDoubleCheck=238, nbFailedDoubleCheck=108]
Hope that helps, I'll keep trying to find whats going on.
|
|
| |
Re: On EMF Compare 2.X scalability [message #1053694 is a reply to message #1052621] |
Tue, 07 May 2013 09:12 |
Cedric Brun Messages: 431 Registered: July 2009 |
Senior Member |
|
|
Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.
Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher
http://pastebin.com/cPZPE1gN
edit : SEARCH_WINDOW = 1500 works well for me from a matching quality perspective.
http://cedric.brun.io news and articles on eclipse and eclipse modeling.
[Updated on: Tue, 07 May 2013 09:50] Report message to a moderator
|
|
| |
Re: On EMF Compare 2.X scalability [message #1054852 is a reply to message #1053807] |
Wed, 08 May 2013 09:46 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Hi Cedric,
thanks for taking some time to look into this.
Quote:
I continued to work on the "search window" capability (not ready for general consumption yet), in the meantime I also introduced a cache for the distance computation.
http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/commit/?id=6350fbb63dc60973510cd27068da73767b9ee014
It cuts my comparison times on decent to big models, could you have a try and tell me how it goes for you ?
I just checked out git HEAD, and the new org.eclipse.emf.compare.match.eobject.internal.CachingDistance seems to be missing in the repository :S I made sure the commit in synchronized with is the same you shared (which happens to be HEAD right now). Maybe you forgot to commit it too?
Quote:
Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.
Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher
http://pastebin.com/cPZPE1gN
edit : SEARCH_WINDOW = 1500 works well for me from a matching quality perspective.
Ill check it and let you know, first I have to get rid of compilation errors
Another thing, not really related, but may help you guys.
There seems to exist a JDK 1.6 usage in org.eclipse.emf.compare.ide.ui:
/org.eclipse.emf.compare.ide.ui/src/org/eclipse/emf/compare/ide/ui/internal/contentmergeviewer/text/EMFCompareTextMergeViewer.java
Line 196:
String leftValueFromWidget = new String(getContents(true), Charsets.UTF_8);
String rightValueFromWidget = new String(getContents(false), Charsets.UTF_8);
I assume you guys want to keep 1.5 compatibility. The bundles execution environment declares J2SE15... so I assume that constructor shouldn't be used.
Best Regards,
Víctor Roldán [Open Canarias]
|
|
|
Re: On EMF Compare 2.X scalability [message #1054854 is a reply to message #1051942] |
Wed, 08 May 2013 09:49 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Hi Cedric,
thanks for taking some time to look into this.
Quote:
I continued to work on the "search window" capability (not ready for general consumption yet), in the meantime I also introduced a cache for the distance computation.
http://git.eclipse.org/c/emfcompare/org.eclipse.emf.compare.git/commit/?id=6350fbb63dc60973510cd27068da73767b9ee014
It cuts my comparison times on decent to big models, could you have a try and tell me how it goes for you ?
I just checked out git HEAD, and the new org.eclipse.emf.compare.match.eobject.internal.CachingDistance seems to be missing in the repository :S I made sure the commit in synchronized with is the same you shared (which happens to be HEAD right now). Maybe you forgot to commit it too?
Quote:
Using the same feature filter for both the Match and Diff would be a better default IMO as long as one can have provide different ones when needed.
Regarding scalability I'm currently experimenting in containing the computation by using a SEARCH_WINDOW just like EMF compare 1.x. By doing so I have quite a few changes in the way big models are matched, I'm not sure yet if these changes are "normal" or if they are problematic and I'm going through them right now. In the meantime, if you want to give it a try, here is the modified ProximityMatcher
http://pastebin.com/cPZPE1gN
edit : SEARCH_WINDOW = 1500 works well for me from a matching quality perspective.
Ill check it and let you know, first I have to get rid of compilation errors
Another thing, not really related, but may help you guys.
There seems to exist a JDK 1.6 usage in org.eclipse.emf.compare.ide.ui:
/org.eclipse.emf.compare.ide.ui/src/org/eclipse/emf/compare/ide/ui/internal/contentmergeviewer/text/EMFCompareTextMergeViewer.java
Line 196:
String leftValueFromWidget = new String(getContents(true), Charsets.UTF_8);
String rightValueFromWidget = new String(getContents(false), Charsets.UTF_8);
I assume you guys want to keep 1.5 compatibility. The bundles execution environment declares J2SE15... so I assume that constructor shouldn't be used.
|
|
| | | |
Re: On EMF Compare 2.X scalability [message #1057870 is a reply to message #1057864] |
Thu, 09 May 2013 12:57 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Quote:
You're right, I forgot to add this file. Pushed a fixing commit right now and triggered a build https://hudson.eclipse.org/hudson/job/emf-compare-master/547/
Great! I'll check if theres any improvement here.
Quote:
by the way, I'm gonna have to bring back chocolate at the office because of this "last minute commit before leaving".
Good one! Is that strategy under EPL? I'd like to implement here at the office too
Quote:
I was going back to our initial discussion : "[..] heuristic to reduce the size of storageToSearchFor by taking only elements with the same parent as the object subject to be matched. Not working perfectly anyway, but reduced the computation time."
Would you mind sharing it with us through a patch or something ?
I lost everything once I checked out from git repository Anyway, the idea behind that naive change I made is some sort of Search Window strategy.
Anyway, given latest results with our big KDM models, and given our time constraints, we will keep using our previous code with EMF Compare 1.3, and the newer code will use 2.X stream, which compares over small models (and seems to work flawlessly, no false positives so far). I'll try to spend sometime in figuring out whats wrong with our KDM models to make 2.X not to scale.
Cheers,
Víctor.
|
|
| | | |
Re: On EMF Compare 2.X scalability [message #1058298 is a reply to message #1054854] |
Mon, 13 May 2013 13:44 |
|
Victor Roldan Betancort wrote on Wed, 08 May 2013 05:49
Another thing, not really related, but may help you guys.
There seems to exist a JDK 1.6 usage in org.eclipse.emf.compare.ide.ui:
/org.eclipse.emf.compare.ide.ui/src/org/eclipse/emf/compare/ide/ui/internal/contentmergeviewer/text/EMFCompareTextMergeViewer.java
Line 196:
String leftValueFromWidget = new String(getContents(true), Charsets.UTF_8);
String rightValueFromWidget = new String(getContents(false), Charsets.UTF_8);
I assume you guys want to keep 1.5 compatibility. The bundles execution environment declares J2SE15... so I assume that constructor shouldn't be used.
What version are you using? I couldn't find any reference to that string constructor in our code base, and there does not seem to be any hit in the history of that class either.
Other than that, any chance you could provide us with the models you are seeing these performance issues with? As Cedric mentionned, we eventually plan on having something more flexible for the content match, but for now profiling would be a lot more help to track down what goes wrong in your case with our current implementation.
Laurent Goubet
Obeo
|
|
| | |
Re: On EMF Compare 2.X scalability [message #1058818 is a reply to message #1058444] |
Wed, 15 May 2013 08:58 |
Victor Roldan Betancort Messages: 524 Registered: July 2009 |
Senior Member |
|
|
Quote:
Yup, that was a mistake from my side (I forgot to "git pull" before checking ... came back from holidays yesterday ).
Java 6 usage is a mistake from another member of the Team, which has been fixed yesterday. This had been introduced post-M7, so only master was affected. As Cédric said in a previous comment, someone will have to bring chocolate to the office, that's the rule here since 2007.
Glad to see that fixed
Quote:
Other than that, any chance you could provide us with the models you are seeing these performance issues with? As Cedric mentionned, we eventually plan on having something more flexible for the content match, but for now profiling would be a lot more help to track down what goes wrong in your case with our current implementation.
I wish I could, these models represent clients code, I dont think I have permission to spread clients code
|
|
|
Re: On EMF Compare 2.X scalability [message #1058869 is a reply to message #1058818] |
Wed, 15 May 2013 13:25 |
|
Quote:
I wish I could, these models represent clients code, I dont think I have permission to spread clients code
I wish we had a model obfuscator somewhere . Shuffling randomly all primitive attributes would probably suffice to have an incomprehensible model that is still representative of a "real world" use case showing these issues?
Or if you can profile and send us a snapshot of the comparison time's profiling (yourkit does a really good job http://www.yourkit.com/purchase/index.jsp)?
Laurent Goubet
Obeo
|
|
| | |
Re: On EMF Compare 2.X scalability [message #1059968 is a reply to message #1059946] |
Wed, 22 May 2013 11:51 |
|
Laurent Goubet wrote on Wed, 22 May 2013 06:00I believe that one of the bottlenecks we've fixed is still present in this snapshot's traces
Sorry about that. It is not the same bottleneck, though it is due to the same mis-use of the guava APIs (or so I think). We'll try and reproduce on our side, but we have a good lead now .
Thanks for the snapshot, we'll keep you updated as soon as we confirm/fix the issue.
Laurent Goubet
Obeo
|
|
| | | | | | | | |
Goto Forum:
Current Time: Wed Sep 25 22:13:04 GMT 2024
Powered by FUDForum. Page generated in 0.06334 seconds
|