|
|
|
|
|
|
|
|
|
|
|
Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676031 is a reply to message #673616] |
Thu, 02 June 2011 11:01   |
Eclipse User |
|
|
|
Hello,
Well, I just finished implementing a "benchmark plugin" in order to compare proposals ranking algorithms efficiency. Results are very interesting.
Here is what I did : I added, besides default subwords proposals, two other engines that add relevance ordering on proposals using respectively Jaro-Winckler and Levenshtein algorithms (that I coded from scratch as needed by EPL, helped by descriptions found on the web).
I made some screenshots to illustrates all of this :
- Default Subwords Engine

- Jaro-Winckler algo.

- Levenshtein algo.

As you can see, Levenshtein algo is very interesting to sort items by accuracy. In some cases, Jaro algo may be confusing (ends with melted results, less understandable). Thus, even if both algos compare strings letter by letter, Levenshtein's one seems to prior matching "subwords" as we intend it (successions of similar letters) while Jaro's one simply counts matching letters on the whole strings.
One major problem risen by this feature is that it mixes methods and fields proposals... It is quite frustrating when using completion. I think we will have to implement extra proposal sorter extension to keep original presentation (first methods, then fields...)
More tests have to be done to check possible limitations on multiple use cases but I can already say that I did not notice any performance loss in diplaying proposals. As images show, I tested on TextArea object that holds lot of items.
I can provide sources if you decide to check by yourself.
I am still running some tests to see how these algo behave with upper/lower-case characters and if they entirely take into account "subwords" as we need it.
|
|
|
|
Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676560 is a reply to message #673616] |
Sun, 05 June 2011 05:01   |
Eclipse User |
|
|
|
Hello!
Some news about the algorithms. After some tests, I found some limitations on our current usage.
I worked mainly on Levenshtein's one since it seems to be cleaner for what we want (at least for now).
First limitation is that it lacks of weighting results on common prefix share : tell me if I'm wrong, but we want to prior proposals that share common prefixes, right? But here is what I have on a simple example :

As you can see, "equalsIgnoreCase" is thrown to the bottom of the list due to its length... So, I tried to find an extension of this algo that could leverage results with common prefix length. I could not find precisely what I wanted (if you know some, please let me know) so I tried to add a home-maid calculation to get more precise results. Here is what I get :

Now, Levenshtein result is leveraged with a common prefix care. I don't really know if my calculation is entirely correct, but it is quite simple and for now it seems quite acceptable.
The second limitation is that Levenshtein does not really take into account subwords as we intend it. See :

So I will try to find improvements, but keep me aware of what you about that...
|
|
|
|
|
|
|
Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676640 is a reply to message #676629] |
Sun, 05 June 2011 16:43   |
Eclipse User |
|
|
|
Hi Pef,
I would propose to go for n-grams + prefix bonus. That's a quite nice set of algorithms you implemented in just one week 
Regarding your computation "fear" (here we drift little bit more into research ideas - you may skip this )
The scoring and evaluation can also be done "post mortem", i.e., if we track what tokens the users enter and what they finally selected from the proposal list, we get into the comfortable position to know what would have been the best solution. Think of it as some kind of feedback a programmer gives you what was actually relevant.
For illustration:
Given a programmer enters "gl" and triggers code completion.
The system now proposes, say, 20 elements that match the regex "gl".
She then selects "getLayoutData()" from the list of proposals.
Now assume you have such kind of information for a few thousand completion events. With this data you can replay the completion events and try out several ranking strategies and identify the one that performs best over all queries, i.e., presents the actually selected proposal on the highest position most of the time.
With such an approach it becomes quite easy to evaluate which algorithm works best or whether we should use a weighted scoring function etc. We just have to share the completion events... But as I said, with this we would get more and more into research. On the other side I'm sure that such a system would probably be the most advance (context-free) completion system Eclipse has ever seen Anyways, let's see how q-grams works out first 
Best,
Marcel
|
|
|
|
|
|
|
|
|
|
Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #686383 is a reply to message #685578] |
Wed, 22 June 2011 02:42   |
Eclipse User |
|
|
|
I'm afraid I'm just going to be offering some opinions. Don't know if it will be of value. I haven't looked at the code yet, and so my points may be uninformed.
I'm sure you've already discovered the "reward" formula in winkler-- this approach to rewarding shows up a lot: w = j + (c * 0.10 * (1 - j)) where w is the resulting winkler, j is the jaro score, c is the number of common prefix characters in [0,4] and 0.10 is the "scaling" factor. So in the edit distance apps I work on (record linkage, fuzzy searching engines, etc.) we often mess with the range of c and the scaling factor of 0.10. Sounds like youre on the right track though, given your results above.
I like the bi or tri-grams approach in general, as i feel it will be better suited than leven and jaro-- and faster. I agree with your observation though that leven and jaro both dont really give you what you want -- you want to be much more conservative in most cases. finding matches where characters are interspersed is bad. I would think the bigrams would be great at this -- and still provide resilience against misspellings. Plus there are only a few thousand unique bigrams so you can do lots of packed representation optimizations. And if you order them beforehand, then counting the common ones is O(n).
I have implementations of jaro and winkler that I could provide if you guys haven't already coded another. I haven't looked at the open source gpl one you looked at, but its probably similar. I have an "optimized" jaro which operates at O(nlogn) instead of pseudo (n^2) (technically jaro is n but doesnt act like it in practice) but I can't donate it as its covered by a patent.
In general I'm super excited about this! Let me know how I can help!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|