Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Eclipse Projects » Platform - User Assistance (UA) » Add pdf files to search index
Add pdf files to search index [message #484994] Thu, 10 September 2009 03:49 Go to next message
Mohamed Hussein is currently offline Mohamed Hussein
Messages: 71
Registered: July 2009
Member
Hello,

I would like to add pdf files to help search index of my rcp product.

I understand that Lucene library can index pdf files (and word documents)
but it seems that Eclipse only supports html files.

Do I need to add a new LuceneSearchProvider and associate it with pdf
extension?

Can I then just delegate adding to the index to the default indexer, or do I
need to use Lucene APIs directly to parse and index the documents?

Thanks in advance for your help,
Mohamed.


Best Regards,
Mohamed.
Re: Add pdf files to search index [message #485412 is a reply to message #484994] Fri, 11 September 2009 12:16 Go to previous messageGo to next message
Chris Goldthorpe is currently offline Chris Goldthorpe
Messages: 815
Registered: July 2009
Senior Member
This sounds like the sort of feature that others in the community may
have already implemented, and if so it would be great to get this
contributed to Eclipse. I'll ask around at IBM to see if we have anyone
looking into this. Meanwhile if anyone else on the newsgroup has
implemented this or thought about implementing this I'd be interested to
know what approach you used.
Re: Add pdf files to search index [message #485978 is a reply to message #485412] Tue, 15 September 2009 15:13 Go to previous messageGo to next message
Lee Anne Kowalski is currently offline Lee Anne Kowalski
Messages: 54
Registered: July 2009
Member
Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?

I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.

So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.

Chris Goldthorpe wrote:
> This sounds like the sort of feature that others in the community may
> have already implemented, and if so it would be great to get this
> contributed to Eclipse. I'll ask around at IBM to see if we have anyone
> looking into this. Meanwhile if anyone else on the newsgroup has
> implemented this or thought about implementing this I'd be interested to
> know what approach you used.

One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d7 86f4e384936fa93ce1137a23b7e422

"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:

PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.

JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------

A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html

The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.

XPDF and Jpedal seem to be GPL or LGPL.

Hope that helps,
Lee Anne
Re: Add pdf files to search index [message #623553 is a reply to message #484994] Fri, 11 September 2009 12:16 Go to previous messageGo to next message
Chris Goldthorpe is currently offline Chris Goldthorpe
Messages: 815
Registered: July 2009
Senior Member
This sounds like the sort of feature that others in the community may
have already implemented, and if so it would be great to get this
contributed to Eclipse. I'll ask around at IBM to see if we have anyone
looking into this. Meanwhile if anyone else on the newsgroup has
implemented this or thought about implementing this I'd be interested to
know what approach you used.
Re: Add pdf files to search index [message #623554 is a reply to message #485412] Tue, 15 September 2009 15:13 Go to previous messageGo to next message
Lee Anne Kowalski is currently offline Lee Anne Kowalski
Messages: 54
Registered: July 2009
Member
Mohamed Hussein wrote:
> Can I then just delegate adding to the index to the default indexer,
or do I
> need to use Lucene APIs directly to parse and index the documents?

I think that in the Eclipse help, it parses the HTML files also. I'm not
sure at which point in the indexing process. I do know that there is an
HTMLParser.java class in the org.eclipse.help.base.source JAR, and I
presume that is there because the HTML files have to be parsed at some
point in the process.

So I would imagine that to get PDF files into the Lucene index, the PDF
files would have to be parsed.

Chris Goldthorpe wrote:
> This sounds like the sort of feature that others in the community may
> have already implemented, and if so it would be great to get this
> contributed to Eclipse. I'll ask around at IBM to see if we have anyone
> looking into this. Meanwhile if anyone else on the newsgroup has
> implemented this or thought about implementing this I'd be interested to
> know what approach you used.

One thought is that the PDF document would need to be parsed. I just
went over to lucene.apache.org and the FAQ has this about indexing PDF:
http://wiki.apache.org/lucene-java/LuceneFAQ#head-c45f8b25d7 86f4e384936fa93ce1137a23b7e422

"In order to index PDF documents you need to first parse them to extract
text that you want to index from them. Here are some PDF parsers that
can help you with that:

PDFBox is a Java API from Ben Litchfield that will let you access the
contents of a PDF document. It comes with integration classes for Lucene
to translate a PDF into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a
Java tool, but there is a utility called pdftotext that can translate
PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate
PDF files into HTML files. This is also not a Java application.

JPedal is a Java API for extracting text and images from PDF documents."
----------------------------------------

A link about PDFBox to extract the text from a PDF:
http://www.pdfbox.org/userguide/text_extraction.html

The PDFBox site says that it is licensed under the BSD License. I don't
know if that is compatible with the Eclipse license, such that PDFBox
would be a viable solution to ship with the Eclipse Platform itself.

XPDF and Jpedal seem to be GPL or LGPL.

Hope that helps,
Lee Anne
Re: Add pdf files to search index [message #899262 is a reply to message #623554] Tue, 31 July 2012 05:12 Go to previous message
Mao Lin is currently offline Mao Lin
Messages: 2
Registered: July 2012
Junior Member
Hi, Yes there should have a parser to parse PDF to extract text for building index, but do we need to code our self or we can just deploy a plug-in for that purpose?
Previous Topic:Indexing PDF in Eclipse Help
Next Topic:helpData.xml Ignored for Infocenter Startup
Goto Forum:
  


Current Time: Thu Apr 24 11:36:57 EDT 2014

Powered by FUDForum. Page generated in 0.01713 seconds