XML Content Search in Eclipse Help

Konrad Kolosowski and Dejan Glozic, 08/08/2005

 

Background

Although Eclipse help requires HTML as the final presentation format, there are no technical reason why documentation cannot be delivered in XML form and transformed into HTML on demand. This approach provides for conditional transformation that can produce different results depending on the context attributes such as platform, target audience, context etc. The visual attributes can also be controlled outside the content.

The purpose of this document is to address the problem of information search in Eclipse help system when XML is used as a content format. Search indexing will have to be performed on the original XML documents, while the search hits obtained from the subsequent searches will point at the transformed HTML documents. This document analyses problems that result from this separation and offers possible solutions.

This document only lists ideas and is by no means an indication of the final implementation. For this reason, please do not make dependencies or otherwise factor in these ideas into your product plans.

In the following text, the acronym 'UA' will be used repeatedly to denote 'User Assistance'.

The problem

Eclipse 3.1 help system uses one instance of Lucene index per locale for indexing all documentation. Multiple fields are used for title, summary, content text analyzed in two ways (stemmed and exact search). Name field is used to relate document to document identifiers understood by the help system - URLs. More than one instance of an index exists only when different locale is used. In the Infocenter installation, multiple indexes are in use simultaneously, but are completely independent. The only two exceptions are that multiple locales are synchronized to prevent memory spikes and locking shared configuration to prevent index corruption by multiple Infocenters.

When indexing help content contributed by XML, keeping 3.1 search features would be fairly easy. The only change required would be to delegate to a different parser or document producer when encountering an XML document instead of an HTML one. An enhancement would be that we will allow other UA components (e.g. intro, cheatsheets) to be indexed, and displayable documents built from XML, rather than have one XML file to one topic relationship.

The following thoughts assume we will allow XML as a format for Eclipse help, but not completely unify format of content contribution across UA components. Content and API will be reusable, thus opening possibility for search beyond classic help system topic, across UA.

The Lucene index structure

Indexing information from components other than help system to the index can be designed using multiple indexes, one for information from each component, or single index for all information. Since the information is of similar type and will likely be searched together, the results must be presented as a unified set and it makes more sense to index all types of documentation using one index for all components.

Since Lucene will serve multiple components, not just help system, it needs to be separated from the tight control of the help system. The life cycle of IndexWriter, IndexReader, IndexSearcher and stability of the index need to be part of the Index component - Lucene plus a manager layer with APIs. Access to the index should be through service like APIs, with participants registered through extension point. This will allow index to manage concurrency and solicit indexing of all components when search call is initiated, change in contributions occurs or recovery from corruption is needed.

Lucene has no assumption of document format. A search result is a document artifact that consists of any number of text fields containing text tokens. Basic search query allows for searching for one term within one field. Searching across fields is accomplished using complex queries built with knowledge of relevant fields. We have flexibility of using additional fields for storing or meta information, in text format. There is no built-in relationship between fields, no structure, and no relationship between documents, making indexing of structured information not straight forward if preserving structure or relationship is important.

We must keep the number of searchable fields finite and low, to preserve good performance. There are two natural approaches for indexing an XML document that use single field that will later be searched. One is to extract all text from the document and index in one field (another field needed to store the document identifier). This is very simple, performing solution, but treats all text from the document equal. Another approach is to treat each element text and each element attribute values as units, and index every one as a dedicated document. For this position to be beneficial, it requires an element context (its containment within document and other element) to be stored, or an identifier assigned that can be used to look up the corresponding XML document outside of Lucene.

Abstracting the indexable document

Forgetting for a while about format of the source document, what makes sense in Eclipse UA is to be able to retrieve document fragments that can be presented on its own or are reusable pieces that can be displayed as part of different pages or different places. The fact that the content originated from an XML file should be irrelevant. Therefore XML content needs to be abstracted into a fragment that will correspond to Lucene document that can be indexed. A sample abstract IndexableDocument can be designed with the following methods:

String getContributor - containing id of index contributor, for examples org.eclipse.ui.intro String getName() - containing unique document identifier understood by a component, for example URL, XPath, or anything that will be needed to retrieve document for displaying Reader getContent() - containing main text content for search String getTitle() - containing higher importance text for search String getKeywords() - containing optional keywords, synonyms for search, present or not in the original document String getRawTitle() - containing one line human readable name/title that can be later retrieved from the index String getSummary() - containing optional, multiple line human readable description of the content String getConstraints() - containing ids of the constraints, like OS, or roles that this content applies to

More methods can be added if necessary, but allowing variable number of fields based on XML schema would take away benefits of using common index between components. Fields private to component depending on each XML definition would practically equal having separate indexes and require different search query for each document type, while theoretically all documents would be kept in the same Lucene index/file. The benefits and performance of such design are questionable.

Each index participant will register their own IndexableDocument factory (parser, text extractor, digester whatever the name) with the index manager. Upon requests for the document the factory will produce the document that can easily be indexed. Assuming we will not be able to unify contribution format across all UA components (old help, intro, and cheat sheets format may need to be supported), each UA component may have their own and multiple of factories or even allow for pluggable factories for undefined source document format. A factory may exist for DITA XML, XHMTL, HTML, PDF or dynamically generated content. From the implementation point of view, factories will mainly be parsers. If there is a need the parsers can be parameterized and work of the schema that is also used for displaying documents, but the challenge is to make them fast, and withstand long documents without large memory consumption.

Collating and filtering results

If separate, each UA component needs to keep track of its own contribution, be able to answer whether it requires indexing (either addition or deletion) of documents, and produce a list of document IDs to add/remove upon indexing triggered by any of components. This will allow for consistent index and searches irrespective of component activation by the user.

Merging of search results from across components occurs automatically, as all documents exist in the same Lucene index. However search results will need to be passed through each indexing participant for filtering and converting of IndexableDocument identifier to a handle to a document that can be presented to the user. Each component would be responsible for filtering search results that should be hidden from the user. For example enabling activity is a frequent operation that should not result in an index change. Filtering based on activities should occur post search, on search results.

Some filtering constraints with well defined set of values, for example OS, WS, ARCH can be indexed with minimal overhead. If indexing occurred on the client machine, documents not satisfying the constraints could be skipped and not indexed at all, but taking advantage of prebuilt indexes requires such content to be indexed and filtered on the client. Keeping such content identified by indexed constraints will allow automatic filtering by complimenting the search with additional boolean query for constraints field.

Filtering based on NL should not occur. Given that UA content is human readable it is almost 100% translatable and very small number of documents is expected to be common between languages. For this reason index should contain text for one locale only as in 3.1 help system.

Optionally, we can add additional filtering field for the UA component's own use. The information there would not be further analyzed, but indexed as is. When performing search, each UA component would have a chance of narrowing search to its own contributions, and further by providing query to apply to additional filtering field. For example, when invoked from a wizard, help system search may provide a query to mandate additional filtering field of the results to contain “task”. The information can be partitioned this way into subsets not necessary understood by all parts of UA.

One of the things expected in future versions of help are dynamically revealed links to related documents, when they are present, without showing broken links all the time. Not trying to support searches for references, it does not look necessary to index links or their description. Omitting links from indexing will ensure that the target document is found when it exists, and not the document that refers to it.

Handling prebuilt indexes

Since index will be shared among components, producing prebuilt indexes will be more challenging. For easiest maintenance and consistency, it is best for the Index component to be responsible for generating and merging prebuilt indexes. It will require that each component will handle and document factories are able to work with documents in the workspace, or other installation of Eclipse.

It is impossible to come up with a perfect ranking algorithm. The more structural differences between documents, the more complicated index is and more difficult to design optimal ranking algorithm. Help system search up to 3.1 suffers from document length having too great an influence on the ranking. Shorter documents with N matches rank much higher than long documents with the same N number of matches in text. It may be necessary to implement custom Scorer for UA searches.