The Doc2Model (Document to Model) framework is a proposed open source component under the Eclipse Modeling Framework Technology project for parsing structured documents (e.g., xlsx, docx, odt, odf...) to produce EMF models. It is in the Project Proposal Phase (as defined in the Eclipse Development Process document) and is written to declare its intent and scope. This proposal is written to solicit additional participation and input from the Eclipse community. You are invited to comment on and/or join the project. Please send all feedback to the Eclipse Modeling Framework Technology (eclipse.technology.emft) newsgroup with [doc2model] as a prefix of the subject line.
The most widely used tools in many organizations continue to be text processors and spreadsheets. Often these documents describe business data that are important to manipulate in other contexts. Examples of data contained in such documents include the following:
- CRC cards
- Structure definitions
- Documentation generation
Because these kinds of tools often produce plain text documents, it's typically quite complex and time-consuming to develop a specific parser able to produce output more amenable to further manipulation.
Currently some organizations are investing effort to publish specifications of open source file formats, for example Office XML (e.g., docx, xlsx...) and Open Document (e.g., odt, odf...) to facilitate widespread adoption and easier consumption.
In fact, most of the business documents are organized with data defined in a common way, (top down for example for text documents) using text style, regular expressions, and column numbering. As such, it's possible to support a generic solution for parsing those documents and transforming the business data into EMF models, using XML parsing and EMF's reflective capabilities.
This project will provide an extensible framework for producing EMF model instances from plain text and structured documents.
Transforming a business document into an EMF model will facilitate more opportunities to exploit the business data contained in such a document. In some cases documents represent the specification of a system. Instead of retyping information to produce the corresponding model it will be possible to generate it.
Doc2Model can be used to, for example, to import requirements from text files and transforming them into SysML requirements models.
The documents file formats which will be managed by Doc2Model include
- Open source formats as docx, xlsx, odt, odf;
- Common formats as csv;
- And formats desired by the eclipse community.
The Doc2Model API will provide extension mechanism to allow users to add custom parsers for specific tools. These Parsers could be contributed to Doc2model component if the license is compatible with EPL.
The target model type is specified using a configuration model which describes how the data is identified during the parsing. This configuration is a map indicating what the generator does when a matching rule is applied. Matching rules make use of regular expressions, special styles, columns (spreadsheet), and tags. Transformation proceeds as follows:
1. Read the matching configuration.
2. Analyse (parse) the input document to identify matching data base on the rules.
3. Produce an output EMF model instance from the data recognized into the input document. Cross references between the data is supported and additional data can be injected.
- input user document
- doc2model mapping
- result after execution
Relationship with Other Eclipse Projects/Components
- Doc2Model will be built on top of EMF.
- Doc2Model will exploit EMF Compare to obtain differences between models.
Third party libraries
- no third party librairies, standard java parsing.
- Topcased is offering doc2model as an initial codebase (see http://gforge.enseeiht.fr/projects/doc2model).
- a flash demo of current version is available here.