Eclipse Community Forums: Platform - User Assistance (UA)

Home » Eclipse Projects » Platform - User Assistance (UA) » Infocenter and lucene bug

Infocenter and lucene bug [message #718304]

Tue, 23 August 2011 17:44

Eclipse User

Hi,

I've deployed an Infocenter as a war and run it both under jetty/tomcat/etc and embedded with jetty. We're getting ready to publish our first release but recently discovered during usability testing that the Lucene StandardAnalyzer had a bug whereby tokens were split on the underscore character. I verified this with some quick internet searches and found that it has apparently been fixed in Lucene 3.1. Obviously for technical documentation this is a pretty big issue. I've been working with Helios which has Lucene 1.9 and just checked Indigo and found that it is still only at 2.9.1. It looks to me like somwhere between 2.9.1 and 2.9.4 the Lucene jars became packaged differently (e.g. since 2.9.4 they use lucene.Analyzers instead of lucene.analysis). Is there any way to patch the Infocenter so that it can use Lucene 3.1? Alternatively, since I pre-generate the indexes, can I use the newer indexer to create the indexes and do you think it would still function properly.

regards,

Re: Infocenter and lucene bug [message #718518 is a reply to message #718304]

Wed, 24 August 2011 11:23

Eclipse User

Answering my own question though someone more knowledgeable please comment.

Apparently indexes created with a different indexer (even version specific) must match the run-time in order to work. So using the newer indexer without patch the run-time is out.

Also there was an API break at 3.0 of Lucene so patching the Infocenter is probably either out of the question or just too bloody hard.

Lucene 2.9.1 which is in indigo and was supposed to have all of the features of 3.0 except the API change still exhibits the bug.

So my only remaining question is, "is there another way to fix this?". Can I tell the Infocenter to use an custom indexer that I build and include in my product plug-in. Also, I found an extension point that lets me capture the search string before it is sent to the search engine, perhaps I can capture the text and a quote any words I find containing underscores?

regards,

Re: Infocenter and lucene bug [message #718849 is a reply to message #718518]

Thu, 25 August 2011 09:57

Eclipse User

So in the dearth of responses, I'm forging ahead trying to override the analyzer by adding the following extension point to my main plugin.xml

<extension
point="org.eclipse.help.base.luceneAnalyzer">
<analyzer
class="org.apache.lucene.analysis.standard.StandardAnalizer"
locale="en">
</analyzer>
</extension>

I then added the lucene-core.3.0.2.jar to my project and referenced org.apache.lucene.analysis.standard in the imported packages, which put it in the manifest.

The subsequent build appears to ignore my pre-generated indexes and builds its own but the ordering of the index still indicates token breaks on underscore and the indexed_dependencies file shows we're still using Lucene 1.9.1.

#This is a generated file; do not edit.
#Thu Aug 25 09:47:41 EDT 2011
lucene=1.9.1.v20100518-1140
analyzer=org.eclipse.help.base\#3.5.2.v201009090800?locale\=en

Re: Infocenter and lucene bug [message #719068 is a reply to message #718518]

Thu, 25 August 2011 19:58

Eclipse User

As you have discovered replacing Lucene is not so easy. While older indexes will work with newer versions of Lucene the converse is not true, newer indexes will not work with older Lucene versions. Also it's quite possible that Lucene 3.x is not backward binary compatible with 2.x, which would mean that the only way to get 3.x to work with Eclipse 3.7 would be to rebuild the help plug-ins from source, which would be quite a lot of work.

Are you noticing this problem only in certain locales? Eclipse will select the analyzer most appropriate for the locale you are in.

Re: Infocenter and lucene bug [message #719352 is a reply to message #719068]

Fri, 26 August 2011 15:20

Eclipse User

Hi Chris,

Thanks for chiming in.

I've found that the analyzer isn't the lucene standardAnalyzer like I thought but one the eclipse help guys appear to have written. In my specific test case I think it is in the english analyzer org.eclipse.help.internal.search.Analyzer_en, specified in the help base plugin.

Looking at the code I found online (I'll download the source jar to make sure), it looks like Analyzer_en calls the LowerCaseAndDigitsTokenizer:

return new PorterStemFilter(new StopFilter(new LowerCaseAndDigitsTokenizer(reader), STOP_WORDS))

The LowerCaseAndDigitsTokenizer looks like it is true to its name:

"Tokenizer breaking words around letters or digits." The code as I read it would break on an underscore.

So my current approach is to try to use the org.eclipse.help.base.luceneAnalyzer extension point and just replace Analyzer_en with my own class based on the same libraries and code that is in helios but my own version of the LowerCaseAndDigitsTokenizer that allows the "_" to be part of a word. The drawback is that I'll probably have to look at doing the same for all of the locales we support fr, es, jp, zh_CN.

What do you think?

[Updated on: Fri, 26 August 2011 15:22] by Moderator

Re: Infocenter and lucene bug [message #720097 is a reply to message #719352]

Mon, 29 August 2011 17:43

Eclipse User

OK. I got this to work. I wrote my own analyzer modeled on Analyzer_en and my own tokenizer modeled on LowerCaseAndDigitsTokenizer the salient difference being that my tokenizer returns true if the character is an underscore. I then pointed the extension point in my plugin at my analyzer and it didn't work. So I extracted the plugin.xml file from org.eclipse.help.base and replaced the reference to Analyzer_en with mine, updated the jar and it worked. Problem is I don't want to ship the modified org.eclipse.help.base jar as I think it might be a violation of the eclipse license. Any comment? Any way to work around this?

Re: Infocenter and lucene bug [message #720117 is a reply to message #720097]

Mon, 29 August 2011 19:19

Eclipse User

As you have discovered the analyzer used at runtime needs to match the analyzer used when the index was built. This is because the search query is parsed using the Lucene analyzer and if the prebuilt index used _ as a break character but the runtime did not a search for abc_def would not find any matches. If underscore is a break character and the same analyzer is used for the prebuilt index as is used at runtime a search for abc_def will find all the correct matches but will also match any document that contains both the words abc and def so you could get a lot of false matches.

Is your problem that you are not seeing valid matches or is it that you are getting false matches? If the former maybe you could open a bug report with an example.

I can't answer the licensing question but I would advise against using your own analyzer for technical reasons, you will be setting yourself up for having to continue doing this for every release of your product. You also have to consider what happens to your product if it is installed alongside other third party plug-ins that have their own help content.

If there is a specific case where Eclipse is not finding a match that it should then maybe you can submit a bug report.

Re: Infocenter and lucene bug [message #720401 is a reply to message #720117]

Tue, 30 August 2011 10:08

Eclipse User

Thanks Chris,

I'd definitely prefer to not ship my own analyzer.

The problem is that when I search for a parameter name, or say an environment variable e.g. index_offset or JAVA_HOME, I get a bunch of hits near the top of the list that don't have any highlighted words in them and the hits I am looking for are scattered fairly far down in the list. If I quote the search, it works correctly. Since this is technical documentation we have many, many such items that people will search on.

Actually I just checked and you can see this in the Eclipse Current Release help documentation. If you search for JAVA_HOME, you get three hits. The first one "Builder Configuration" doesn't appear to have JAVA_HOME in it. The second two do have it and are correct. If you search for "JAVA_HOME", "Builder Configuration" is left off of the list.

Personally I think it is a bug considering eclipse help is largely used for technical documentation. But I don't know if calling it a bug would get much attention, what do you think?

[Updated on: Tue, 30 August 2011 10:53] by Moderator

Re: Infocenter and lucene bug [message #720590 is a reply to message #720401]

Tue, 30 August 2011 18:05

Eclipse User

If you take the Bugzilla route the earliest this could get fixed would be Eclipse 3.8, this change would require prebuilt indexes to be generated which would be unacceptable for a point release. I'm guessing that is too far out for your immediate needs.

There is one other idea I thought of which uses a new extension point org.eclipse.help.searchProcessor which was introduced in Eclipse 3.7. This allows you to tweak the query string before a search is performed. This would allow you to change the search terms in any search which contained a term which included underscores.

I agree with you that underscores should not be treated as a break character, this was presumably a design decision made in the early days of the Eclipse help system.

Previous Topic:	filtering with infocenter as WAR
Next Topic:	Embedding Images into Cheat Sheets

Goto Forum:

-=] Back to Top [=-

Current Time: Sat Jul 05 23:00:50 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter