Eclipse Community Forums: Code Recommenders » Help Wanted: Ranking Strategy for Subwords Completion Engine

Search

Help

Login

Home » Archived » Code Recommenders » Help Wanted: Ranking Strategy for Subwords Completion Engine

Show: Today's Messages :: Show Polls :: Message Navigator

Switch to threaded view of this topic

Create a new topic

Submit Reply

Help Wanted: Ranking Strategy for Subwords Completion Engine [message #673616]

Tue, 24 May 2011 09:07

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi,

after Paul-Emmanuel volunteered to contribute to the subwords completion engine I just added a "helpwanted" bug to Bugzilla about implementing a ranking strategy for subwords completion engine.

If some of you like to jump in and spent a few hours on implementing this feature as a little "finger exercise", your contribution is greatly appreciated Smile

Smile

Please find the details there:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=346947

Best,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674108 is a reply to message #673616]

Wed, 25 May 2011 20:44

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hello(again) Marcel,

Reading Jaro-Winkler article gave me some ideas on how we could push it into our current code... I don't want to monopolize features development, but I am quiet interested in managing this, if no one else commits.

Keep me aware!

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674142 is a reply to message #674108]

Wed, 25 May 2011 22:59

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Paul-Emmanuel,

feel invited to go for this too Smile

Smile

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674412 is a reply to message #674142]

Thu, 26 May 2011 20:31

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Great!

But first of all, how can I implement Jaro-Winkler algorithm? I mean, I read in Eclipse Legal Process that it is not recommended/easy to use non-Eclipse content... So can I re-use open source libraries like http://staffwww.dcs.shef.ac.uk/people/sam.chapman@k-now.co.uk/stringmetrics.html#jarowinkler ?

Thanks

[Updated on: Thu, 26 May 2011 20:32]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674422 is a reply to message #674412]

Thu, 26 May 2011 21:10

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Unfortunately, SimMetrics is under GPL license. Thus, we are not allowed reuse nor copy it Sad

Sad

This requires us to re-implement it from scratch - without copying any lines from it. Licenses are one of the odd things in open-source...

For details about EPL & GPL you may read http://www.eclipse.org/legal/eplfaq.php#GPLCOMPATIBLE

Best,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674485 is a reply to message #674422]

Fri, 27 May 2011 06:27

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

After thinking about it for a while:

Using JaroWinkler is just a guess what algorithm might work well for this. This means it may turn out that it is a perfect match - or not. How do you feel about using SimMetrics for a proof of concept just to see how it works? Then we could easily use different metrics to evaluate how they perform w/o efforts implementing them from scratch.

Since this code cannot go to Eclipse then, Johannes could create a public git repository on our repository hosting server so that we can share the intermediate results outside Eclipse (w/o licensing issues) and integrate it into our CI server.

How about that?

[Updated on: Sun, 05 June 2011 09:56]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674496 is a reply to message #674485]

Fri, 27 May 2011 06:51

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

I agree, creating this server could let us make and share some attempts, to make choices. So I'm in, give me the credentials when you have it.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #674537 is a reply to message #674496]

Fri, 27 May 2011 09:35

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I set up an repository for that. Project name is subwords.
I also created a wiki page to document how the repository can be accessed: http://wiki.eclipse.org/Recommenders/Labs#Access_to_lab_repositories
Tell me if something is not clear or you get stuck.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #675399 is a reply to message #674537]

Tue, 31 May 2011 11:58

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hello,

I found that we may use Levenshtein distance algorithm (http://www.merriampark.com/ld.htm). It seems like it likely matches to what we are looking for to compare our subwords... Tell me if I'm wrong.

This algorithm is quite simple to implement, as I have already written a simple 50 lines utility that seems to work efficiently.

I only found one limitation : it seems that some implementations rise OutOfMemory errors when comparing long strings... But I think we will not face this problem in our use case.

I can provide my snippet if needed.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #675412 is a reply to message #675399]

Tue, 31 May 2011 12:47

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Levenshtein is great for very similar words and slightly misspelled ones (typos). One drawback with Levenshtein is that it does not prefer common prefixes over general matches - a feature which seems valuable as pointed out by Deepak in the blog post. But I also came across Levenshtein and wasn't sure which one to take. That's why I asked for a recommendation at SimMetrics forum a two weeks ago:

http://sourceforge.net/projects/simmetrics/forums/forum/621875/topic/4535973

How about that:
Implement both and create a "benchmark" for both approaches, i.e., a few samples that show how code completion would work with Levenshtein and JaroWinkler? Later we may make the completion algorithms exchangeable to see how both work and to ease switching to newer ones?

What do you think?

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #675546 is a reply to message #675412]

Tue, 31 May 2011 20:06

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Ok I understand... I will try to build some meaningful examples benchmarked with the two algorithms. I will present you the results as soon as I have ones.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676031 is a reply to message #673616]

Thu, 02 June 2011 15:01

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hello,

Well, I just finished implementing a "benchmark plugin" in order to compare proposals ranking algorithms efficiency. Results are very interesting.

Here is what I did : I added, besides default subwords proposals, two other engines that add relevance ordering on proposals using respectively Jaro-Winckler and Levenshtein algorithms (that I coded from scratch as needed by EPL, helped by descriptions found on the web).

I made some screenshots to illustrates all of this :

Default Subwords Engine
Jaro-Winckler algo.
Levenshtein algo.

As you can see, Levenshtein algo is very interesting to sort items by accuracy. In some cases, Jaro algo may be confusing (ends with melted results, less understandable). Thus, even if both algos compare strings letter by letter, Levenshtein's one seems to prior matching "subwords" as we intend it (successions of similar letters) while Jaro's one simply counts matching letters on the whole strings.

One major problem risen by this feature is that it mixes methods and fields proposals... It is quite frustrating when using completion. I think we will have to implement extra proposal sorter extension to keep original presentation (first methods, then fields...)

More tests have to be done to check possible limitations on multiple use cases but I can already say that I did not notice any performance loss in diplaying proposals. As images show, I tested on TextArea object that holds lot of items.

I can provide sources if you decide to check by yourself.

I am still running some tests to see how these algo behave with upper/lower-case characters and if they entirely take into account "subwords" as we need it.

Attachment: Subwords_default_completion.jpg
(Size: 388.39KB, Downloaded 2051 times)
Attachment: Subwords_jaro_completion.jpg
(Size: 403.62KB, Downloaded 2139 times)
Attachment: Subwords_levenshtein_completion.jpg
(Size: 381.01KB, Downloaded 2081 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676102 is a reply to message #676031]

Thu, 02 June 2011 20:31

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

nice piece of work!

Thanks for the screenshots. I think it would be great if we all could try out the different algorithms for a while to see how they work. Could you attach the sources to https://bugs.eclipse.org/bugs/show_bug.cgi?id=346947 so that we can try them locally?

Best,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676560 is a reply to message #673616]

Sun, 05 June 2011 09:01

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hello!

Some news about the algorithms. After some tests, I found some limitations on our current usage.
I worked mainly on Levenshtein's one since it seems to be cleaner for what we want (at least for now).

First limitation is that it lacks of weighting results on common prefix share : tell me if I'm wrong, but we want to prior proposals that share common prefixes, right? But here is what I have on a simple example :

index.php/fa/2938/0/

index.php/fa/2938/0/

As you can see, "equalsIgnoreCase" is thrown to the bottom of the list due to its length... So, I tried to find an extension of this algo that could leverage results with common prefix length. I could not find precisely what I wanted (if you know some, please let me know) so I tried to add a home-maid calculation to get more precise results. Here is what I get :

index.php/fa/2939/0/

index.php/fa/2939/0/

Now, Levenshtein result is leveraged with a common prefix care. I don't really know if my calculation is entirely correct, but it is quite simple and for now it seems quite acceptable.

The second limitation is that Levenshtein does not really take into account subwords as we intend it. See :

index.php/fa/2940/0/

index.php/fa/2940/0/

So I will try to find improvements, but keep me aware of what you about that...

Attachment: default_levenshtein.jpg
(Size: 140.59KB, Downloaded 2447 times)
Attachment: pounded_levenshtein.jpg
(Size: 138.87KB, Downloaded 2151 times)
Attachment: not_subwords_levenshtein.jpg
(Size: 246.22KB, Downloaded 2002 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676567 is a reply to message #676560]

Sun, 05 June 2011 09:53

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

I like your approach to get prefix matches on top. This is definitely a feature I would like to keep.
Regarding the subwords matching. Did you consider to use q-grams for this? I can imagine this may be more what's needed to score subwords (given your experience on the other metrics)?

http://www.ryan-h.com/kdb/q-grams-for-fuzzy-string-matching-in-kdb/
http://staffwww.dcs.shef.ac.uk/people/sam.chapman@k-now.co.uk/stringmetrics.html#qgram

[Updated on: Sun, 05 June 2011 09:55]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676611 is a reply to message #676567]

Sun, 05 June 2011 17:36

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hi,

Well I read some papers about n-grams counting, it is quite interesting! But I have a question about all of that : how should we leverage Levenshtein result with this algo and "my" prefix matching? I mean : for now, Levenshtein give us the differences count between strings. Then, I substract my home-maid calculation to reduce the gap depending on prefixes. Now n-grams could give us some more matching count, but what weight should we give it regarding the others?
Another related question : what n-gram should we use? I think that 3-gram is the most interesting in our use case. But we may try to also look for 4-grams ou 5-grams to increase subwords' length matches and proposal relevance... But this kind of search seems a little bit greedy in calculation time...
What do you think?

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676617 is a reply to message #676611]

Sun, 05 June 2011 17:55

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

I think 2- or 3-grams should be fine, right? In addition: No matter what works best at then end, in a smart implementation this just a configuration param Smile

Smile

Regarding combination of n-grams with Leventshtein: Do you already see an advantage of combining both metrics? I can imagine to just use your prefix scoring + n-grams. However, "scoring" is something that we can "learn" later on. Literally Smile

Smile

You may just come up with a weighted sum formula that considers all three metrics. The weights then can be tweaked later on (after we have some user data) to find the best weights for them. This project is getting more and more interesting Smile

Smile

See http://en.wikipedia.org/wiki/Weight_function for what I mean with a weighted sum formula.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676629 is a reply to message #676617]

Sun, 05 June 2011 19:02

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Right... My fear about a sum formula is that it might be too much time consuming for all proposals. And I am not sure that users might be able to choose between some strategies to sort proposals (if we let them chose by configuration).
All those ideas are interesting but for our project I think we have to make a choice or we will scatter in too complex computations, won't we?
The more I read docs about n-grams the more I think I might be more pertinent to our problem : when users type a few letters and invoque content assist, they may have a little idea of what they are looking for. I think they may have typed succession of subwords rather than random letters. So Levenshtein loses weight compared to n-grams + prefix scoring, what do you think? In this point of view, we may prefer implement the second strategy rather than Lvenshtein's one.
Don't hesitate to contradict me, I'm just thinking aloud!

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #676640 is a reply to message #676629]

Sun, 05 June 2011 20:43

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

I would propose to go for n-grams + prefix bonus. That's a quite nice set of algorithms you implemented in just one week Smile

Smile

Regarding your computation "fear" (here we drift little bit more into research ideas - you may skip this Smile

Smile

)

The scoring and evaluation can also be done "post mortem", i.e., if we track what tokens the users enter and what they finally selected from the proposal list, we get into the comfortable position to know what would have been the best solution. Think of it as some kind of feedback a programmer gives you what was actually relevant.

For illustration:
Given a programmer enters "gl" and triggers code completion.
The system now proposes, say, 20 elements that match the regex "gl".
She then selects "getLayoutData()" from the list of proposals.
Now assume you have such kind of information for a few thousand completion events. With this data you can replay the completion events and try out several ranking strategies and identify the one that performs best over all queries, i.e., presents the actually selected proposal on the highest position most of the time.

With such an approach it becomes quite easy to evaluate which algorithm works best or whether we should use a weighted scoring function etc. We just have to share the completion events... But as I said, with this we would get more and more into research. On the other side I'm sure that such a system would probably be the most advance (context-free) completion system Eclipse has ever seen Smile

Smile

Anyways, let's see how q-grams works out first Smile

Smile

Best,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #683485 is a reply to message #673616]

Mon, 13 June 2011 20:55

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hello,

Following your advices, I created a little app to help us compare results generated by all algorithms, I mean simplier than the Eclipse plugins.
I joint an executable Jar file that launches a simple tool made to append proposals and differents results.

I'm going to make deeper tests in the next few days, but I provide you this tool that may help.

File is normally accessible here : http://fyels.com/ttK00p

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #683490 is a reply to message #683485]

Mon, 13 June 2011 21:11

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

can you make the Eclipse source project available? I'm having trouble to execute your jar since my paths and swt versions differ:

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:58)
Caused by: java.lang.UnsatisfiedLinkError: Could not load SWT library. Reasons:
no swt-win32-3659 in java.library.path
no swt-win32 in java.library.path

A zip of the source project should work.
Thanks,
Marcel

[Updated on: Mon, 13 June 2011 21:11]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #683974 is a reply to message #673616]

Tue, 14 June 2011 19:24

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Here is the zip file containing the Eclipse project. I hope you will be able to run it.

Attachment: Tests.zip
(Size: 25.17KB, Downloaded 487 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #683986 is a reply to message #683974]

Tue, 14 June 2011 19:53

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Your Zip contains SubwordsBenchmark and SWTResourceManager only. Are there any other files?

How about using the git repository Johannes created?

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #684016 is a reply to message #673616]

Tue, 14 June 2011 21:24

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Sorry about that... This zip may be complete.
How could we use this GIT?

Attachment: workspace.zip
(Size: 75.89KB, Downloaded 498 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #684267 is a reply to message #674537]

Wed, 15 June 2011 10:05

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I posted a hint about that some posts before:

Johannes Lerch wrote on Fri, 27 May 2011 05:35

I set up an repository for that. Project name is subwords.
I also created a wiki page to document how the repository can be accessed: http://wiki.eclipse.org/Recommenders/Labs#Access_to_lab_repositories
Tell me if something is not clear or you get stuck.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #685578 is a reply to message #684267]

Fri, 17 June 2011 19:43

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

The project is commited, I think. I hope I did not make any mistake while pushing to repo...
Keep me aware of your opinion about it.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #686383 is a reply to message #685578]

Wed, 22 June 2011 06:42

Steve Ash is currently offline

Friend

Messages: 4
Registered: June 2011

Junior Member

I'm afraid I'm just going to be offering some opinions. Don't know if it will be of value. I haven't looked at the code yet, and so my points may be uninformed.

I'm sure you've already discovered the "reward" formula in winkler-- this approach to rewarding shows up a lot: w = j + (c * 0.10 * (1 - j)) where w is the resulting winkler, j is the jaro score, c is the number of common prefix characters in [0,4] and 0.10 is the "scaling" factor. So in the edit distance apps I work on (record linkage, fuzzy searching engines, etc.) we often mess with the range of c and the scaling factor of 0.10. Sounds like youre on the right track though, given your results above.

I like the bi or tri-grams approach in general, as i feel it will be better suited than leven and jaro-- and faster. I agree with your observation though that leven and jaro both dont really give you what you want -- you want to be much more conservative in most cases. finding matches where characters are interspersed is bad. I would think the bigrams would be great at this -- and still provide resilience against misspellings. Plus there are only a few thousand unique bigrams so you can do lots of packed representation optimizations. And if you order them beforehand, then counting the common ones is O(n).

I have implementations of jaro and winkler that I could provide if you guys haven't already coded another. I haven't looked at the open source gpl one you looked at, but its probably similar. I have an "optimized" jaro which operates at O(nlogn) instead of pseudo (n^2) (technically jaro is n but doesnt act like it in practice) but I can't donate it as its covered by a patent.

In general I'm super excited about this! Let me know how I can help!

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #686531 is a reply to message #685578]

Wed, 22 June 2011 12:53

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

Paul-Emmanuel Faidherbe wrote on Fri, 17 June 2011 15:43

The project is commited, I think. I hope I did not make any mistake while pushing to repo...
Keep me aware of your opinion about it.

Sry, but i have to tell you that something went wrong Wink

Wink

You can find comments directly in gerrit attached to your commit: http://vandyk.st.informatik.tu-darmstadt.de:8081/#change,214

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #686611 is a reply to message #686383]

Wed, 22 June 2011 15:32

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hey Steve,

you help is greatly appreciated. So far, Pef did most of the work on the ranking strategies. Thus, I've to catch up with his latest changes to see where we are at the moment. He's currently pushing his code to code recommenders' labs so that everyone can checkout try out the different strategies.

One thing that comes in my mind which would be great to figure out (although it's more a UI task than an algorithmic one): When you trigger code completion there is no highlighting which characters match in the proposal. IDEA has this neat highlighting (see http://goo.gl/BY0El for an example). It would be cool if you or someone else can spent a few moments to figure out whether this is possible with current implementations or how to achieve that. This is just an intermediate action until we have everything in place to get the real tool evaluation done but a neat one.

Aside this, it would be great to your feedback on the different ranking strategies Pef worked on. I think we will have all code related to the proposals engine in the lab repository by end of the week to you can see what's there and how completion works with current solution. I'll post an update if we have a lab build up an running for this.

Thanks,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #688297 is a reply to message #686611]

Fri, 24 June 2011 11:46

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I just checked the benchmark tool made by pef (see attachment for a screenshot). It gives a nice feeling about how the algorithms will behave on real data and in which cases they do not fit the expectations.
To give you insight of that experience i want to give a hint on how to get the code: You should have read access as anonymous user on the repository, so you can clone it by using the command: git clone ssh://vandyk.st.informatik.tu-darmstadt.de:29418/subwords
To get write access follow the steps here: http://wiki.eclipse.org/Recommenders/Labs

After you have the code, you can run the tool by launching the main in class SubwordsBenchmark. That file is located in the com.developpef.benchmarktool project.

I'm currently thinking about to start writing some unit-tests to specify what results i would like to see for different prefixes. Maybe this will be a good basis to discuss which results you expect and finally to agree on one target. And of course we already have tests then Smile

Smile

Attachment: subwords-benchmark.png
(Size: 31.27KB, Downloaded 769 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #688300 is a reply to message #688297]

Fri, 24 June 2011 11:54

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Just a little precision. Looking at the screenshot I realized I forgot to specify that we are currently using a 3-gram algorithm. This seemed a pretty good choice to start with but maybe we could do more tests with 2-grams for example, or even with combinations of both. I may commit a little evolution of the tool regarding this option.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #688345 is a reply to message #688300]

Fri, 24 June 2011 13:31

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I added some tests (Class ExpectedScorings in *.tests.codecompletion.subwords.benchmark). Please see them as basis for discussions.
I added JaroWinkler as scoring algorithm just to have any algorithm in there. Feel free to try others.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689064 is a reply to message #688345]

Sun, 26 June 2011 17:41

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hi,

I added the little feature on the test tool about paramaterized Ngrams. If you test it, you could see that with 2 chars length window, we have a more precise granularity on proposals. I think this could be enhanced with common prefix weight like I added on Levenshtein.

I am also testing your scoring expectations class to see how we could converge to.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689093 is a reply to message #689064]

Sun, 26 June 2011 20:04

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

did you push your changes? Regarding your proposal: We should definitely add the prefix scoring and use 2grams. Can you make a guess when you have this ready? I would like to announce subwords as soon as possible Smile

Smile

Best,
Marcel

[Updated on: Sun, 26 June 2011 20:04]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689110 is a reply to message #689093]

Sun, 26 June 2011 21:14

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

I think I pushed everything. All is in Gerrit. I just added a home-maid calculation to leverage results with common prefix + NGrams + lengths. Best results shown with 2-grams. Keep me aware of what you think about it.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689111 is a reply to message #689110]

Sun, 26 June 2011 21:22

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

i just "found" it. I though you would push directly into head. My first impression is quite good. I'll give it a try tomorrow the whole day to see how it works out in daily use.

BTW:

When I enter "acc" as input I get an exception:

---------------- Jaro-Winckler ----------------
Prefix : acc
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.eclipse.recommenders.rcp.codecompletion.subwords.benchmark.relevance.SubwordsJaroWinklerDistance.calculateDistance(SubwordsJaroWinklerDistance.java:100)
at org.eclipse.recommenders.rcp.codecompletion.subwords.benchmark.relevance.SubwordsJaroWinklerDistance.calculateRelevanceDistance(SubwordsJaroWinklerDistance.java:37)
at org.eclipse.recommenders.tests.codecompletion.subwords.benchmark.ProposalsDistanceBenchmark.testJaroDistance(ProposalsDistanceBenchmark.java:110)
at com.developpef.SubwordsBenchmark$2.widgetSelected(SubwordsBenchmark.java:157)
at org.eclipse.swt.widgets.TypedListener.handleEvent(TypedListener.java:234)
at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
at org.eclipse.swt.widgets.Display.sendEvent(Display.java:3783)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1375)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1398)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1383)
at org.eclipse.swt.widgets.Widget.notifyListeners(Widget.java:1195)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:3629)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3284)
at com.developpef.SubwordsBenchmark.open(SubwordsBenchmark.java:66)
at com.developpef.SubwordsBenchmark.main(SubwordsBenchmark.java:50)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689117 is a reply to message #689111]

Sun, 26 June 2011 21:44

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Bug is now fixed

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689129 is a reply to message #689117]

Sun, 26 June 2011 22:52

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

It works. Can you create a patch containing your current 2gram implementation with less than 200 lines and attach this to bugzilla?

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689194 is a reply to message #689129]

Mon, 27 June 2011 06:32

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

To wich bugzilla do you want me to attach this? Bug 346947?

[Updated on: Mon, 27 June 2011 06:35]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #689209 is a reply to message #689194]

Mon, 27 June 2011 07:21

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

exactly

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690420 is a reply to message #689209]

Wed, 29 June 2011 13:49

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I just integrated your latest contribution. You might have to search a little to find it, as i refactored a lot of stuff. I also added some tests and found some bugs. For example i think qgrams is symmetric and your implementation was not. I fixed those issues, but i would appreciate that you will take a look on that to verify that i didn't broke it.

The main classes of interest are now SubwordsRelevanceCalculator, QGramSimilarity and the corresponding test classes SubwordsRelevanceCalculatorTest, QGramSimilarityTest and ExpectedScoringsTest. As you can see your current approach matches my expectations. Nice job!

But can you give me more details on the weighting function? What was the intent of the factor 20 and why is the difference of the length of both strings relevant?

For all who want to test the current state: Get it from the head channel. I turned off the default java proposals and will check what it feels like to only have the subwords completion.

[Updated on: Thu, 30 June 2011 07:10]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690604 is a reply to message #690420]

Wed, 29 June 2011 20:03

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi,

I just installed the subwords completion. This engine is great!

During my surface tests I found a situation I'm a bit surprised of the ranking results. Please have a look at the screenshot:
index.php/fa/3182/0/

index.php/fa/3182/0/

Is there any special reason why "getDisplay" is ranked before "getLayoutData" ?

Attachment: Screen shot 2011-06-29 at 21.57.45.png
(Size: 86.90KB, Downloaded 1625 times)

[Updated on: Thu, 30 June 2011 04:59]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690729 is a reply to message #690604]

Thu, 30 June 2011 05:46

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

There is a reason. The weighting function gives a higher rank if the difference between token and completion string length is low.
In your case lay is shorter than both completions, so the shorter completion will be preferred. I asked in my previous post what the idea behind this is, as i am also surprised by the result.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690730 is a reply to message #690729]

Thu, 30 June 2011 05:49

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Ah, ok. Given the example I think it's a bit unexpected behavior. I think the earlier a token could be found the higher the proposal should be, right?

[Updated on: Thu, 30 June 2011 05:49]

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690748 is a reply to message #690730]

Thu, 30 June 2011 06:56

Johannes Lerch is currently offline

Friend

Messages: 18
Registered: February 2011

Junior Member

I'm able to follow that argumentation, but i think it's more of theoretic nature. I would prefer to see any completions that have the same matchings (ignoring the position of that match if it's not a prefix) in alphabetic order, because that is what i usually will see.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #690756 is a reply to message #690748]

Thu, 30 June 2011 07:20

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Agreed. Let's drop the string length stuff for the moment, ignore my idea about matching bigram locations, and stick to a plain alphabetical order + prefix + bigrams scoring (+ JDT base score).

Thanks,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #691307 is a reply to message #690756]

Fri, 01 July 2011 07:51

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Just to keep you updated:

We updated the ranking strategies and also added a highlighting to subwords:
index.php/fa/3205/0/

index.php/fa/3205/0/

You can install the latest version directly from the head update site (subwords feature). Your feedback is appreciated and always welcome!

Attachment: subwords-highlighting.png
(Size: 53.12KB, Downloaded 1601 times)

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #696040 is a reply to message #673616]

Wed, 13 July 2011 06:57

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hi,
Sorry for my absence but I just had some hard days at work... I will update your modifications and try to improve my code regarding your latest remarks if still needed. As you infered it, my weighting strategy was to prior shortest proposal matches (as Levenshtein proposed). But your choices seem interesting. I will give you my feedback as soon as I tested it.
Thanks for your investment!

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #696053 is a reply to message #696040]

Wed, 13 July 2011 07:43

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Hi Pef,

we adapted some parts of your code. You find the latest version in the org.eclipse.recommmenders git repository. We came across an issue with relevance scoring since JDT does not support updating the relevance of a proposal. This is annoying and confusing when typing.

I'm not sure whether we can (yet) deliver the bigrams solution with that issue. I reported the issue here: https://bugs.eclipse.org/bugs/show_bug.cgi?id=350991 JDT is yet not very responsive because of the upcoming Java7 release. But I hope we'll get a solution rather soon.

In the meanwhile it would be cool if you could install the latest version from HEAD update site and give it a try?

Best,
Marcel

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #698749 is a reply to message #696053]

Wed, 20 July 2011 07:24

Paul-Emmanuel Faidherbe is currently offline

Paul-Emmanuel Faidherbe

Friend

Messages: 30
Registered: April 2011
Location: france

Member

Hi,

Just to inform you, I still had no time to take a look at the new code... So sorry. And I am now leaving for my two weeks summer holidays, so I will not be able to help you more for this time.

I hope you will do great things, but I know you all rock! Smile

Smile

Regards.

Report message to a moderator

Re: Help Wanted: Ranking Strategy for Subwords Completion Engine [message #698761 is a reply to message #698749]

Wed, 20 July 2011 07:43

Marcel Bruch is currently offline

Friend

Messages: 289
Registered: July 2009

Senior Member

Have a good time. When you are back, subwords will be part of dev - unfortunately without relevance ranking until https://bugs.eclipse.org/bugs/show_bug.cgi?id=350991 is fixed.

Best,
Marcel

Report message to a moderator

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	Early Access: Extended Javadoc View Prototype
Next Topic:	.recommenders folder is not removed along with its nature

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Wed Sep 25 21:44:11 GMT 2024

Powered by FUDForum. Page generated in 0.08722 seconds

.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top