Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » Compare » Diff process on huge file is really long (v2.1.0M6)
Diff process on huge file is really long (v2.1.0M6) [message #1029174] Fri, 29 March 2013 04:57 Go to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Hi,

I'm currently trying to migrate from 1.2.2 (On which I had made several local hard customizations) to the last version 2.1.0 (M6). But I have really poor performances when comparing a really big file (120Mo, with more than 1 000 000 objects).

It seems that the bottleneck is the diff process which is using a MatchCrossReferencer to check values (most of time, when I pause the threads in debug mode, I see that it is into ComparisonSpec.getInverse(...)). It's taking lot of time for each object in this function, like if the match were redone for each object checked by the diff-engine (what would be strange as the match is already done when making the diff).

I had previously (for version 1.2.2) done a local customization to index in a Map all matched objects by their XMI Id (when there is, which is my case), and it was really improving performances. Did you make the same thing in version 2.1 ? If not, would it resolve the problem, and can I make a clean customization to do so ?

[Updated on: Fri, 29 March 2013 04:58]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1032592 is a reply to message #1029174] Wed, 03 April 2013 03:26 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Hi,

Without knowing exactly what you did to customize 1.2.2, I will remain generic.

If your models use XMI IDs, then indeed the Matching phase will be extremely fast. In such a scenario, the differencing will always be the bottleneck. Further than that, what usually takes up a lot of time is determining differences in large lists of reference/attribute values: we re-use the result of the match (through the cross referencer you've noticed), but we still have a lot of information to compute, notably for ordering differences.

120Mo for 1 000 000 elements seems like an odd model though... such large files are usually divided into more usable fragments. Could you provide us with your sample so that we can profile and see whether there are ways to improve our performances on such datasets?

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1032875 is a reply to message #1032592] Wed, 03 April 2013 10:54 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Firstly, thanks for your answer.

About the model, I cannot provide it to you as it is, but it is quite simple:

A container Container has 8 or 9 non-ordered containment references to different type of classes (A, B, C, ...). Each of these classes can also contains 2 or 3 elements (so the sub-structure is really not heavy).
So the model is quite flat. I'll try to share an example...

We strongly encourage users to split their data into several files (Containers can reference other containers), but for our validation, we do the worst case where everything is in one file. I also made a mistake with the size of our big model: it is only about 200 000 objects.

By continuing my investigations about these performances problems, I saw two points:
¤ In DefaultDiffEngine.computeContainmentDifferencesThreeWay(...), we have
Iterables.filter(leftValues, not(containedIn(comparison, lcsOriginLeft)))
is used for computing missing elements, and this is iterating, for each object of the resource, over all contained objects (so 200 000 * 200 000 times). This doesn't seems to be efficient as I think that we can already have this non-matched elements list from the previous match process ?
¤ In FeatureFilter.checkForOrderingChanges(...), the order is checked for any containment. I think that this is needed only if we don't have XMIIDs (because in this case references are done thanks to the position of the elements in their container), so it could be interesting to check if XMIIDs are available, in which case only the isOrdered attribute should be taken into account.
Re: Diff process on huge file is really long (v2.1.0M6) [message #1032945 is a reply to message #1032875] Wed, 03 April 2013 13:02 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
On the first point (Iterables.filter(leftValues, not(containedIn(...)))), I think that the biggest point is that all this process is really heavy overall to be able to detect move changes. But the thing is that is it always done, even if the checkOrdering parameter is false. In such case, the process should be different and much more simple in order to use the previous match information and to reduce drastically time taken by this cross re-checking.
I'll try to apply this solution and i will keep you aware.

[Updated on: Wed, 03 April 2013 13:03]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1033407 is a reply to message #1032875] Thu, 04 April 2013 03:52 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Guillaume,

I'll use the term "LCS" to talk about the "Longest Common Subsequence" here, you might want to read about it on wikipedia if you wish to understand more on that.

Quote:

¤ In DefaultDiffEngine.computeContainmentDifferencesThreeWay(...), we have
Iterables.filter(leftValues, not(containedIn(comparison, lcsOriginLeft)))
is used for computing missing elements, and this is iterating, for each object of the resource, over all contained objects (so 200 000 * 200 000 times). This doesn't seems to be efficient as I think that we can already have this non-matched elements list from the previous match process ?


This is why determining differences on very large reference lists is long. The fact that your model is very flat is what makes this process inefficient. Do note that we already reuse the match (and XMIID) information here: "containedIn" will use the comparison to check for matches.

Also, please note that this is not aimed at detected "non-matched" elements. This aims at detecting "objects that are not in the LCS between the two lists": anything that is not part of it is a difference, be it add, delete or move. In other words, we are detecting all differences with these two lines.

We might be able to do a better job here if we "remembered" both the LCS and the rest of the input sequence instead of re-checking afterwards for LCS.contains(). One question though: did you use a profiler to get to these two potential bottlenecks or were you simply "pausing" the process at random?

Quote:

¤ In FeatureFilter.checkForOrderingChanges(...), the order is checked for any containment. I think that this is needed only if we don't have XMIIDs (because in this case references are done thanks to the position of the elements in their container), so it could be interesting to check if XMIIDs are available, in which case only the isOrdered attribute should be taken into account.


I do not see why you think having XMIIDs changes anything here. Whether you have IDs or not, your objects might still have moved. And XMIID or not, containment or not, what we call "ordering" is the same: if a reference contains A, B and C on one side, but B and A and C on the other side, then "B" has been moved. We do not check for IDs here.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1033497 is a reply to message #1033407] Thu, 04 April 2013 05:32 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Hi Laurent,

I didn't understood that the move change was also including move between two containers (I was only thinking about reordering in the same object containement/reference).
The XMIID thing is that if you don't have ids (XMIID or custom ID), then the references are serialized thanks to the position of the element in its container (something like ../myfolder/fileX.ext#//orders.14), and in this case if an element A is moved into an element B from the 1rst position to the 2nd position, then the references from C to A (for instance) will be broken if the change is not merged. If you have ids, the element A can be moved at any place in the reference (without changing of container and reference) without any impact on object referencing it. In such case, if the containment reference is not ordered, then it's not needed to search for "internal moves".

I think that using the LCS process (and then the containment test with an iteration through the reference content) is something really heavy when the content of the reference is huge, and needed only for the case of reordering in the same reference of the same object. In other cases, checking if there is a match or if the eContainer of the element on both side would be sufficient. I began making tests with such process and it's really, really faster (but I need to improve it and to make more tests to check that I lose nothing).

Also the change you're speaking about ("remembering both the LCS and the rest of the input sequence instead of re-checking afterwards for LCS.contains()") would be better. But in my case, the problem is really this re-iteration to check containment for all elements, even when the reordering (not the move from a container to another) check is not needed.

Last point: yes, I just paused randomly and several times the process... what is not really a good process to find such bottleneck, but I don't know existing good tool for that (I was using TPTP before, but it seems that it has not been updated for Indigo).
Re: Diff process on huge file is really long (v2.1.0M6) [message #1033621 is a reply to message #1029174] Thu, 04 April 2013 08:46 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Here is my improvement for the computeContainmentDifferencesThreeWay method (I'll try to improve also other methods):

@Override
protected void computeContainmentDifferencesThreeWay(Match match, EReference reference, boolean checkOrdering) {
	final Comparison comparison = match.getComparison();
	
	// We won't use iterables here since we need random access collections
	// for fast LCS.
	final List<Object> leftValues = getAsList(match.getLeft(), reference);
	final List<Object> rightValues = getAsList(match.getRight(), reference);
	final List<Object> originValues = getAsList(match.getOrigin(), reference);
	
	// TODO Can we shortcut in any way?
	
	final Iterable<Object> changedLeft;
	final Iterable<Object> changedRight;
	if (checkOrdering) {
		// search common parts
		final List<Object> lcsOriginLeft = DiffUtil.longestCommonSubsequence(comparison, originValues, leftValues);
		final List<Object> lcsOriginRight = DiffUtil.longestCommonSubsequence(comparison, originValues, rightValues);
		
		// Which values have "changed" in any way?
		changedLeft = Iterables.filter(leftValues, not(containedIn(comparison, lcsOriginLeft)));
		changedRight = Iterables.filter(rightValues, not(containedIn(comparison, lcsOriginRight)));
	} else {
		// only search elements moved from a container to another, or added
		changedLeft = new ArrayList<Object>();
		for (Object obj : leftValues) {
			if (obj instanceof EObject) {
				Match m = comparison.getMatch((EObject) obj);
				if (m != null && m.getOrigin() != null && sameContainment(comparison, m.getOrigin(), (EObject) obj)) {
					continue;
				}
			}
			((List<Object>) changedLeft).add(obj);
		}
		changedRight = new ArrayList<Object>();
		for (Object obj : rightValues) {
			if (obj instanceof EObject) {
				Match m = comparison.getMatch((EObject) obj);
				if (m != null && m.getOrigin() != null && sameContainment(comparison, m.getOrigin(), (EObject) obj)) {
					continue;
				}
			}
			((List<Object>) changedRight).add(obj);
		}
	}
	
	// Added or moved in left
	for (Object diffCandidate : changedLeft) {
		final Match candidateMatch = comparison.getMatch((EObject) diffCandidate);
		if (candidateMatch == null || candidateMatch.getOrigin() == null) {
			// No matching element here --> It's a ADD
			featureChange(match, reference, diffCandidate, DifferenceKind.ADD, DifferenceSource.LEFT);
		} else if (!sameContainment(comparison, candidateMatch.getOrigin(), (EObject) diffCandidate)) {
			// This is a MOVE (moved from another containment)
			featureChange(match, reference, diffCandidate, DifferenceKind.MOVE, DifferenceSource.LEFT);
		} else if (checkOrdering) {
			/*
			 * We are dealing with an element which has changed, and we know
			 * that it's not added nor moved from a containment to another.
			 * So it can only having been reordered in the same containment
			 * list:
			 */
			featureChange(match, reference, diffCandidate, DifferenceKind.MOVE, DifferenceSource.LEFT);
		}
	}
	
	// Added or moved in right
	for (Object diffCandidate : changedRight) {
		final Match candidateMatch = comparison.getMatch((EObject) diffCandidate);
		if (candidateMatch == null || candidateMatch.getOrigin() == null) {
			// No matching element here --> It's a ADD
			featureChange(match, reference, diffCandidate, DifferenceKind.ADD, DifferenceSource.RIGHT);
		} else if (!sameContainment(comparison, candidateMatch.getOrigin(), (EObject) diffCandidate)) {
			// This is a MOVE (moved from another containment)
			featureChange(match, reference, diffCandidate, DifferenceKind.MOVE, DifferenceSource.RIGHT);
		} else if (checkOrdering) {
			/*
			 * We are dealing with an element which has changed, and we know
			 * that it's not added nor moved from a containment to another.
			 * So it can only having been reordered in the same containment
			 * list:
			 */
			featureChange(match, reference, diffCandidate, DifferenceKind.MOVE, DifferenceSource.RIGHT);
		}
	}
	
	// deleted from either side
	for (Object diffCandidate : originValues) {
		/*
		 * A value that is in the origin but not in the left/right either
		 * has been deleted or is a moved element which previously was in
		 * this reference. We'll detect the move on its new reference.
		 */
		final Match candidateMatch = comparison.getMatch((EObject) diffCandidate);
		if (candidateMatch == null) {
			// out of scope
		} else {
			if (candidateMatch.getLeft() == null) {
				featureChange(match, reference, diffCandidate, DifferenceKind.DELETE, DifferenceSource.LEFT);
			}
			if (candidateMatch.getRight() == null) {
				featureChange(match, reference, diffCandidate, DifferenceKind.DELETE, DifferenceSource.RIGHT);
			}
		}
	}
}

/**
 * Check if the given two objects are in a container which the same on each
 * side, and in the same containment feature.
 * @param comparison comparison object
 * @param obj1 first object
 * @param obj2 second object
 * @return true if they are in the same container
 */
protected boolean sameContainment(Comparison comparison, EObject obj1, EObject obj2) {
	EObject obj1Cont = obj1.eContainer();
	EObject obj2Cont = obj2.eContainer();
	Match m1 = comparison.getMatch(obj1Cont);
	Match m2 = comparison.getMatch(obj2Cont);
	return m1 != null && m2 != null && m1 == m2 && obj1.eContainmentFeature().equals(obj2.eContainmentFeature());
}

[Updated on: Thu, 04 April 2013 08:47]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1033687 is a reply to message #1033497] Thu, 04 April 2013 10:19 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Quote:
Hi Laurent,

I didn't understood that the move change was also including move between two containers (I was only thinking about reordering in the same object containement/reference).
The XMIID thing is that if you don't have ids (XMIID or custom ID), then the references are serialized thanks to the position of the element in its container (something like ../myfolder/fileX.ext#//orders.14), and in this case if an element A is moved into an element B from the 1rst position to the 2nd position, then the references from C to A (for instance) will be broken if the change is not merged. If you have ids, the element A can be moved at any place in the reference (without changing of container and reference) without any impact on object referencing it. In such case, if the containment reference is not ordered, then it's not needed to search for "internal moves".


The rationale, was that users often want their models to be "the same" after a merge all... and containment moves, even if unordered, are pretty obvious.

You make a pretty good point though. That will have to be further looked at.

Quote:

But in my case, the problem is really this re-iteration to check containment for all elements, even when the reordering (not the move from a container to another) check is not needed.


The double-checking you've outlined here (through iterables.filter) is not done to check for move or reordering. It is done to check for both addition and moves.

Quote:

Last point: yes, I just paused randomly and several times the process... what is not really a good process to find such bottleneck, but I don't know existing good tool for that (I was using TPTP before, but it seems that it has not been updated for Indigo).


Though not free, Yourkit does an extremely good job.

I am currently profiling this myself, and you were right on mark. DefaultDiffEngine.contains() is one of two large bottlenecks of the differencing process for such datasets. There are a few possible ways to improve the code without hurting the result (and the performance on more usual datasets).

[Edit: I'll look into the changes you propose here. Please use gerrit to provide us patchs, it is far easier for us to look at. See the wiki.]

[Updated on: Thu, 04 April 2013 10:21]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1034463 is a reply to message #1033687] Fri, 05 April 2013 09:10 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Guillaume,

A little update: we'll use something very similar to what you proposed here. Considering that any "containment" diff is ordered whatever the state of the "ordered" boolean seems bogus (Okay, I am the one who made that choice initially, but still Very Happy). The LCS computation can be improved, as is the "double-check" that actually slows down everything; but we can at least respect the ordered boolean and avoid incurring extra costs.

Raised bug 405000 to track discussions in case we eventually stumble upon a case where this approach cannot hold.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1034490 is a reply to message #1034463] Fri, 05 April 2013 09:44 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Hi Laurent !

Great for this change Very Happy I hope there will be no problem with such solution.

About using gerrit, I don't use git and cannot access to external git or SVN repositories from my workstation, so I think it's more simple to share the code here (or by mail if you prefer).
And about YourKit, I didn't know it, I think I will have a look at it. Thanks for the advice Wink

One another side, I'm currently working on the computeMultiValuedFeatureDifferencesThreeWay which is also too slow for my bigTestModel, because I have some non-ordered references containing something like 1000 elements. I'm currently working on a solution which does no more use LCS at all, and which does not makes repeated containment checks (I reused a customization I already had done for version 1.2.2)... So it's really faster again, but I have some bugs I want to investigate before sharing the code. The idea is to not search the Longest Common Subsequence, but just to read the two lists simultaneously and to detect on the fly what has been added/removed/moved compared to the reference list (the Origin).
Re: Diff process on huge file is really long (v2.1.0M6) [message #1034522 is a reply to message #1034490] Fri, 05 April 2013 10:41 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Guillaume,

Before you go on any further: I have made a few improvements on both the LCS and its use for both containment and non-containment references presenting the "issue" you have here (i.e. containing a lot of values Razz). (Within a reference containing 9500 elements, what previously took 3 minutes now takes a little more than 40 seconds.) Among them is the removal of the two "filter" calls that are the source of the slow-downs and, as you are trying to do, read the lcs _during_ the iteration on the values instead of filtering beforehand.

Removing the LCS at all will indeed make the code faster since it is a costly operation... but can you be sure you consider all cases? Most diff algorithms use the LCS in one way or another, it is one of the best ways to determine the shortest edit path between two sequences: we wish to have the minimal set of differences between two models.

The unit tests of EMF Compare cover a lot of the potential use cases; make sure they run with your modifications Wink.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1036469 is a reply to message #1034522] Mon, 08 April 2013 08:33 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Great ! How can I test it ? As I don't have repository access, I can only download released versions...

But I terminated a fix on my side also (with the my custom algorithm).
The idea is the following:

refList : A --> B --> C --> D --> G --> E --> F

newList : A --> G --> C --> D --> G --> E --> F --> B

I read the two lists at same time, but stop advancing on the refList or on the newList when there is a difference. In such case, I continue advancing only on one side until I find the beginning of new matching part. (I join the code in the message)

It is working perfectly fine, but you're right on the point that it is not always giving the best solution (the less changes possible).

For instance, in the example above, it will find
- G added
- C, D , E, F moved

if we invert the lists, it's finding
- G added
- B moved

On another side, by writing this solution, I found another problem in the algorithm you use: it doesn't seems to be working when you have several times the same object in the list (if it not unique). You can try it easily by creating an object with a many Integer attribute (for instance), and comparing 0 --> 1 --> 2 with 0 --> 1 --> 2 --> 0.

Additionnaly, in such case, when looking at differences result, if there is a removal you cannot know which element at which index has been removed. So I think that the algorithm should return changed index (not only changed elements), and it should be stored in the diff.
Re: Diff process on huge file is really long (v2.1.0M6) [message #1036471 is a reply to message #1034522] Mon, 08 April 2013 08:35 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Hi,

Updates and patch linked on the associated bug. The main info is that we finally concluded that determining ordering differences on containment references is too important to remove. There is no good way to check if the order is important: EMF uses it for its references in some cases and, even when there are IDs, XMI:IDs or eKeys, EMF still provides ways to access uris that use the ordering. As such, I have left the condition of FeatureFilter that tells EMF Compare to check for ordering on "ordered" or containment.

The performance is greatly enhanced and is no longer as exponential as it was (i.e. the "stupid" filtering I had there and that you located very early is now gone). Though still quadratic in growth, it should be much less noticeable. Memory should also be much less strained with lists containing under 2^15 (32768) elements.

Further enhancements are possible to scale better on very large reference lists such as what you seem to have... but are not planned yet.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1036531 is a reply to message #1036471] Mon, 08 April 2013 09:57 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Ok, waiting to test your changes, I remade the algorithm to make it using the LCS system (as it seems to be the best solution). It's now able to compare two lists, by selecting the solution with as less changes as possible, supporting several identical elements, and providing index for each change (even if it's not usefull currently).
The result is provided through 3 lists (added, removed, moved), and so there is no more need for repeated containment checks (as it is done currently to check if the changed value was in the remaining values of the origin list).

Here is the final code using this utility:

Override
protected void computeMultiValuedFeatureDifferencesThreeWay(Match match, EStructuralFeature feature, boolean checkOrdering) {
	final Comparison comparison = match.getComparison();
	
	final List<Object> leftValues = getAsList(match.getLeft(), feature);
	final List<Object> rightValues = getAsList(match.getRight(), feature);
	final List<Object> originValues = getAsList(match.getOrigin(), feature);
	
	
	// compute left changes
	List<IndexedObject> addedLeft = new ArrayList<IndexedObject>();
	List<IndexedObject> removedLeft = new ArrayList<IndexedObject>();
	List<IndexedObject> movedLeft = checkOrdering ? new ArrayList<IndexedObject>() : null;
	CompareListsUtil.computeListDiff(comparison, originValues, leftValues, addedLeft, removedLeft, movedLeft);
	
	// compute right changes
	List<IndexedObject> addedRight = new ArrayList<IndexedObject>();
	List<IndexedObject> removedRight = new ArrayList<IndexedObject>();
	List<IndexedObject> movedRight = checkOrdering ? new ArrayList<IndexedObject>() : null;
	CompareListsUtil.computeListDiff(comparison, originValues, rightValues, addedRight, removedRight, movedRight);
	
	
	// create final changes left
	for (IndexedObject addedLeftObj : addedLeft) {
		featureChange(match, feature, addedLeftObj.getObj(), DifferenceKind.ADD, DifferenceSource.LEFT);
	}
	for (IndexedObject removedLeftObj : removedLeft) {
		featureChange(match, feature, removedLeftObj.getObj(), DifferenceKind.DELETE, DifferenceSource.LEFT);
	}
	if (checkOrdering) {
		for (IndexedObject movedLeftObj : movedLeft) {
			featureChange(match, feature, movedLeftObj.getObj(), DifferenceKind.MOVE, DifferenceSource.LEFT);
		}
	}
	
	// create final changes right
	for (IndexedObject addedRightObj : addedRight) {
		featureChange(match, feature, addedRightObj.getObj(), DifferenceKind.ADD, DifferenceSource.RIGHT);
	}
	for (IndexedObject removedRightObj : removedRight) {
		featureChange(match, feature, removedRightObj.getObj(), DifferenceKind.DELETE, DifferenceSource.RIGHT);
	}
	if (checkOrdering) {
		for (IndexedObject movedRightObj : movedRight) {
			featureChange(match, feature, movedRightObj.getObj(), DifferenceKind.MOVE, DifferenceSource.RIGHT);
		}
	}
}


and

@Override
protected void computeContainmentDifferencesThreeWay(Match match, EReference reference, boolean checkOrdering) {
	final Comparison comparison = match.getComparison();
	
	final List<Object> leftValues = getAsList(match.getLeft(), reference);
	final List<Object> rightValues = getAsList(match.getRight(), reference);
	final List<Object> originValues = getAsList(match.getOrigin(), reference);
	
	
	// compute left changes
	List<IndexedObject> addedLeft = new ArrayList<IndexedObject>();
	List<IndexedObject> removedLeft = new ArrayList<IndexedObject>();
	List<IndexedObject> movedLeft = checkOrdering ? new ArrayList<IndexedObject>() : null;
	CompareListsUtil.computeListDiff(comparison, originValues, leftValues, addedLeft, removedLeft, movedLeft);
	
	// compute right changes
	List<IndexedObject> addedRight = new ArrayList<IndexedObject>();
	List<IndexedObject> removedRight = new ArrayList<IndexedObject>();
	List<IndexedObject> movedRight = checkOrdering ? new ArrayList<IndexedObject>() : null;
	CompareListsUtil.computeListDiff(comparison, originValues, rightValues, addedRight, removedRight, movedRight);
	
	
	// create final changes left
	for (IndexedObject addedLeftObj : addedLeft) {
		final Match candidateMatch = comparison.getMatch((EObject) addedLeftObj.getObj());
		if (sameContainment(comparison, candidateMatch.getOrigin(), (EObject) addedLeftObj)) {
			featureChange(match, reference, addedLeftObj.getObj(), DifferenceKind.ADD, DifferenceSource.LEFT);
		} else {
			featureChange(match, reference, addedLeftObj.getObj(), DifferenceKind.MOVE, DifferenceSource.LEFT);
		}
	}
	for (IndexedObject removedLeftObj : removedLeft) {
		final Match candidateMatch = comparison.getMatch((EObject) removedLeftObj.getObj());
		if (candidateMatch != null && candidateMatch.getLeft() == null) {
			featureChange(match, reference, removedLeftObj.getObj(), DifferenceKind.DELETE, DifferenceSource.LEFT);
		}
	}
	if (checkOrdering) {
		for (IndexedObject movedLeftObj : movedLeft) {
			featureChange(match, reference, movedLeftObj.getObj(), DifferenceKind.MOVE, DifferenceSource.LEFT);
		}
	}
	
	// create final changes right
	for (IndexedObject addedRightObj : addedRight) {
		final Match candidateMatch = comparison.getMatch((EObject) addedRightObj.getObj());
		if (sameContainment(comparison, candidateMatch.getOrigin(), (EObject) addedRightObj)) {
			featureChange(match, reference, addedRightObj.getObj(), DifferenceKind.ADD, DifferenceSource.RIGHT);
		} else {
			featureChange(match, reference, addedRightObj.getObj(), DifferenceKind.MOVE, DifferenceSource.RIGHT);
		}
	}
	for (IndexedObject removedRightObj : removedRight) {
		final Match candidateMatch = comparison.getMatch((EObject) removedRightObj.getObj());
		if (candidateMatch != null && candidateMatch.getRight() == null) {
			featureChange(match, reference, removedRightObj.getObj(), DifferenceKind.DELETE, DifferenceSource.RIGHT);
		}
	}
	if (checkOrdering) {
		for (IndexedObject movedRightObj : movedRight) {
			featureChange(match, reference, movedRightObj.getObj(), DifferenceKind.MOVE, DifferenceSource.RIGHT);
		}
	}
}

/**
 * Check if the given two objects are in a container which the same on each
 * side, and in the same containment feature.
 * @param comparison comparison object
 * @param obj1 first object
 * @param obj2 second object
 * @return true if they are in the same container
 */
protected boolean sameContainment(Comparison comparison, EObject obj1, EObject obj2) {
	EObject obj1Cont = obj1.eContainer();
	EObject obj2Cont = obj2.eContainer();
	Match m1 = comparison.getMatch(obj1Cont);
	Match m2 = comparison.getMatch(obj2Cont);
	return m1 != null && m2 != null && m1 == m2 && obj1.eContainmentFeature().equals(obj2.eContainmentFeature());
}


And I join the new version of the compare utility in a file.

I hope it will be usefull.

About the ordering check for containment, I understand your choice and I still have the possibility to override the FeatureFilter on my side in order to change it. So no problem on this point.

It's great to see how you are attentive to questions on your tool Very Happy

[EDIT: I also added the code for computeContainmentDifferencesThreeWay()]

[Updated on: Mon, 08 April 2013 10:28]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1037132 is a reply to message #1036469] Tue, 09 April 2013 03:50 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Hi Guillaume,

I did not see the message you posted just before my own yesterday, so answers for both:

Quote:

Great ! How can I test it ? As I don't have repository access, I can only download released versions...


We should have an integration ready later today, I'll post the link here when it's out of the oven.

Quote:

It is working perfectly fine, but you're right on the point that it is not always giving the best solution (the less changes possible).


There is a lot of documentation on the matter, you could read about the various LCS implementations and the Levenshtein distance if you are interested by the matter.

We chose to use the LCS to solve this as it seemed like the best way, which is why I was a little reluctant on removing it now (I am pretty sure that we would forget about some special cases).

Quote:

On another side, by writing this solution, I found another problem in the algorithm you use: it doesn't seems to be working when you have several times the same object in the list (if it not unique). You can try it easily by creating an object with a many Integer attribute (for instance), and comparing 0 --> 1 --> 2 with 0 --> 1 --> 2 --> 0.


Yes. I saw that too when rewriting the algorithm to remove my stupid filterings. That cas is handled now.

Quote:

Additionnaly, in such case, when looking at differences result, if there is a removal you cannot know which element at which index has been removed. So I think that the algorithm should return changed index (not only changed elements), and it should be stored in the diff.


Actually, the LCS elements appear in the same order they do in the actual input lists. As such, we can easily determine the actual index of the changed element if we properly iterate on the LCS "one at a time" instead of pre-filtering as we previously did.

Quote:

The result is provided through 3 lists (added, removed, moved), and so there is no more need for repeated containment checks (as it is done currently to check if the changed value was in the remaining values of the origin list).


The while loop of computeListDiff really scares me Smile. The exit condition is just too complex to properly read or test.

Furthermore, I believe that there are more comparisons (equal tests) and slightly more iterations in your code than in the change I proposed... I'll run your implementation against our test suite and profile it to check.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1037171 is a reply to message #1037132] Tue, 09 April 2013 04:40 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Quote:
Actually, the LCS elements appear in the same order they do in the actual input lists. As such, we can easily determine the actual index of the changed element if we properly iterate on the LCS "one at a time" instead of pre-filtering as we previously did.


Hum... The problem I could see is when merging. In this case the index must be re-computed. Are you sure that it's also working when you have several times the same object (is it re-detecting the index of the object at the same index, even if some other changes in the same list on the same object at another index have already been merged ?) ?
I fear that when making such merge it remove/move/add the element at the wrong place if we previoulsy merged a change on an equal object which was also at another place (I don't know if I am really understandable).

Quote:
The while loop of computeListDiff really scares me. The exit condition is just too complex to properly read or test.


In computeListDiff, on each step (while iteration) we decide to go to next element for the left and/or the right list, and if we go to next element for both, then we go also for the LCS list. The exit condition is "stop if we will try to get next element on a list where there is no more element".
It's equivalent to:
boolean someEndReached = false;
if (goToNext) {
	someEndReached = iterator.hasNext();
}
if (goToNextRef) {
	someEndReached = someEndReached || iteratorRef.hasNext();
}
if (goToNextRef || goToNext) {
	someEndReached = someEndReached || iteratorCommon.hasNext();
}

\\ and while would be (!someEndReached)

So the maximum complexity is the one for computing LCS + list.size + listRef.size. And the advantage is that with the given result, there is no more need for comparing each leftChange, or rightChange to the originValues with the contains() method (which is really heavy when there is lot of changes, for instance when 4000 elements are added to a list).

I'm sorry if I really insist on such small improvements, but in my case it's really important because the comparison was not taking just 1 hour but really near of one day (or more, I killed the process after that)... and it's an operation which can be done in a normal case, even if we encourage users to have spited models. So any small improvement has really big impacts on performances for such case.

[Updated on: Tue, 09 April 2013 04:50]

Report message to a moderator

Re: Diff process on huge file is really long (v2.1.0M6) [message #1037215 is a reply to message #1037171] Tue, 09 April 2013 05:50 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Quote:

Hum... The problem I could see is when merging. In this case the index must be re-computed. Are you sure that it's also working when you have several times the same object (is it re-detecting the index of the object at the same index, even if some other changes in the same list on the same object at another index have already been merged ?) ?
I fear that when making such merge it remove/move/add the element at the wrong place if we previoulsy merged a change on an equal object which was also at another place (I don't know if I am really understandable).


The merge process is not tied to the differencing process. We re-compute the proper insertion index for each merge operation. The user can merge whatever difference first ... so we need to take into account all things that will/won't change when determining the insertion point.

Recomputing the index allows us to be sure that it is good... but it presents its own issues. When merging a single diff, that's fine. However, when "batch" merging differences (using "merge all non conflicting differences" for example), this forces us to re-compute the LCS for every single difference in the list to merge... And that leads to unacceptable performance.

We introduced a batch merging API (IBatchMerger) in order to take this into account... but we do not have an intelligent merger behind that yet. The idea would be to compute the LCS once for a reference where multiple diffs are to be merged, then update it after every merge operation instead of re-computing it from scratch. That's not coded yet though... and not planned for Kepler.

Quote:

I'm sorry if I really insist on such small improvements, but in my case it's really important because the comparison was not taking just 1 hour but really near of one day (or more, I killed the process after that)... and it's an operation which can be done in a normal case, even if we encourage users to have spited models. So any small improvement has really big impacts on performances for such case.


No worries, performance issues as big as the two filters you located are critical enough to fix. However, patches provided through forum are hard to verify and test. As I mentionned, I was looking at yours to run the tests with it and profile its performance.

Your proposed patch provokes 57 test errors (ClassCastException) and 16 failures (we detect more moves with your patch than expected... and some conflicts are not properly detected, but that may be related). Fixing the ClassCast still leaves 27 errors and 27 failures. Performance-wise, your patch performs as well (slightly slower, but that's in the range of experimental errors) as what we have in the pipeline.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1037219 is a reply to message #1037171] Tue, 09 April 2013 05:53 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
I just remarked that my compareListDiff was containing an error: we can have differences on both sides at same time without being in sync with the LCS list. In such case we should not go to the next element of the LCS.
So there is now a goToNextCommon boolean, and the while condition is understandable like this:

boolean noEndReached = true;
if (goToNext) {
	noEndReached = iterator.hasNext();
}
if (goToNextRef) {
	noEndReached = someEndReached && iteratorRef.hasNext();
}
if (goToNextCommon) {
	noEndReached = someEndReached && iteratorCommon.hasNext();
}


I join the fixed version.
Re: Diff process on huge file is really long (v2.1.0M6) [message #1037230 is a reply to message #1037219] Tue, 09 April 2013 06:05 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Ha, I didn't see your message before my post.

I didn't had ClassCastException on my side, but your test is probably more complete (can you just give me the line ?). About the number of raised differences, I think that it's resolved by the version in the post above. But in any case, if it's as fast (or even less) as what you did, and if all issues have been fixed, then I will just use your new version Very Happy
Re: Diff process on huge file is really long (v2.1.0M6) [message #1037289 is a reply to message #1037230] Tue, 09 April 2013 07:33 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Guillaume,

The class cast exception(s) were on the code you copy/pasted in a previous message namely

for (IndexedObject addedLeftObj : addedLeft) {
  final Match candidateMatch = comparison.getMatch((EObject) addedLeftObj.getObj());
  if (sameContainment(comparison, candidateMatch.getOrigin(), (EObject) addedLeftObj)) {
...

where casting "addedLeftObj" as an EObject can only fail.

I cannot test your above version right now, but yes, the performance is roughly the same. The real issue will most likely arise during batch merging operations as I mentionned above if you are in such a case.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1037350 is a reply to message #1037289] Tue, 09 April 2013 08:54 Go to previous messageGo to next message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Guillaume,

I see that now, but the forum tells me you are in Toulouse. If that is true, some of my colleagues will be in Toulouse next week for a roadshow related to Eclipse modeling technologies. I won't be there myself, but I thought you might be interested Wink.

Laurent Goubet
Obeo
Re: Diff process on huge file is really long (v2.1.0M6) [message #1039623 is a reply to message #1037350] Fri, 12 April 2013 07:23 Go to previous messageGo to next message
Guillaume P. is currently offline Guillaume P.
Messages: 69
Registered: June 2010
Location: Toulouse, France
Member
Thank you for the information, but I think that I already had a presentation of the Obeo Designer (we already met each other previously).
Re: Diff process on huge file is really long (v2.1.0M6) [message #1042513 is a reply to message #1039623] Tue, 16 April 2013 10:20 Go to previous message
Laurent Goubet is currently offline Laurent Goubet
Messages: 1625
Registered: July 2009
Senior Member
Hi Guillaume,

FYI, the changes made for this bug have been pushed and are available in the latest nightly build (p2 update site).

Laurent Goubet
Obeo
Previous Topic:Element addition should be refined by References set
Next Topic:Logical model compare (Compare 1.3.1, with SVN)
Goto Forum:
  


Current Time: Sat Aug 30 00:24:02 EDT 2014

Powered by FUDForum. Page generated in 0.02423 seconds