Top

Serialization

Serialization is the process of transforming an EMF model into its textual representation. Thereby, serialization complements parsing and lexing.

In Xtext, the process of serialization is split into the following steps:

  1. Validating the semantic model. This is optional, enabled by default, done by the concrete syntax validator and can be turned off in the save options.
  2. Matching the model elements with the grammar rules and creating a stream of tokens. This is done by the parse tree constructor.
  3. Associating comments with semantic objects. This is done by the comment associator.
  4. Associating existing nodes from the node model with tokens from the token stream.
  5. Merging existing white space and line-wraps into the token stream.
  6. Adding further needed white space or replacing all white space using a formatter.

Serialization is invoked when calling XtextResource.save(..) (src). Furthermore, the Serializer (src) provides resource-independent support for serialization. Another situation that triggers serialization is applying Quick Fixes with semantic modifications. Serialization is not called when a textual editors contents is saved to disk.

The Contract

The contract of serialization says that a model which is saved (serialized) to its textual representation and then loaded (parsed) again yields a new model that is equal to the original model. Please be aware that this does not imply, that loading a textual representation and serializing it back produces identical textual representations. However, the serialization algorithm tries to restore as much information as possible. That is, if the parsed model was not modified in-memory, the serialized output will usually be equal to the previous input. Unfortunately, this cannot be ensured for each and every case. A use case where is is hardly possible, is shown in the following example:

MyRule:
  (xval+=ID | yval+=INT)*;

The given MyRule reads ID- and INT-elements which may occur in an arbitrary order in the textual representation. However, when serializing the model all ID-elements will be written first and then all INT-elements. If the order is important it can be preserved by storing all elements in the same list - which may require wrapping the ID- and INT-elements into other objects.

Roles of the Semantic Model and the Node Model During Serialization

A serialized document represents the state of the semantic model. However, if there is a node model available (i.e. the semantic model has been created by the parser), the serializer

  • preserves existing white spaces from the node model.
  • preserves existing comments from the node model.
  • preserves the representation of cross-references: If a cross-referenced object can be identified by multiple names (i.e. scoping returns multiple IEObjectDescriptions (src) for the same object), the serializer tries to keep the name that was used in the input file.
  • preserves the representation of values: For values handled by the value converter, the serializer checks whether the textual representation converted to a value equals the value from the semantic model. If that is true, the textual representation is kept.

Parse Tree Constructor

The parse tree constructor usually does not need to be customized since it is automatically derived from the Xtext Grammar. However, it can be helpful to look into it to understand its error messages and its runtime performance.

For serialization to succeed, the parse tree constructor must be able to consume every non-transient element of the to-be-serialized EMF model. To consume means, in this context, to write the element to the textual representation of the model. This can turn out to be a not-so-easy-to-fulfill requirement, since a grammar usually introduces implicit constraints to the EMF model as explained for the concrete syntax validator.

If a model can not be serialized, an XtextSerializationException (src) is thrown. Possible reasons are listed below:

  • A model element can not be consumed. This can have the following reasons/solutions:
    • The model element should not be stored in the model.
    • The grammar needs an assignment which would consume the model element.
    • The transient value service can be used to indicate that this model element should not be consumed.
  • An assignment in the grammar has no corresponding model element. The default transient value service considers a model element to be transient if it is unset or equals its default value. However, the parse tree constructor may serialize default values if this is required by a grammar constraint to be able to serialize another model element. The following solution may help to solve such a scenario:
    • A model element should be added to the model.
    • The assignment in the grammar should be made optional.
  • The type of the model element differs from the type in the grammar. The type of the model element must be identical to the return type of the grammar rule or the action's type. Subtypes are not allowed.
  • Value conversion fails. The value converter can indicate that a value is not serializeable by throwing a ValueConverterException (src).
  • An enum literal is not allowed at this position. This can happen if the referenced enum rule only lists a subset of the literals of the actual enumeration.

To understand error messages and performance issues of the parse tree constructor it is important to know that it implements a backtracking algorithm. This basically means that the grammar is used to specify the structure of a tree in which one path (from the root node to a leaf node) is a valid serialization of a specific model. The parse tree constructor's task is to find this path - with the condition, that all model elements are consumed while walking this path. The parse tree constructor's strategy is to take the most promising branch first (the one that would consume the most model elements). If the branch leads to a dead end (for example, if a model element needs to be consumed that is not present in the model), the parse tree constructor goes back the path until a different branch can be taken. This behavior has two consequences:

  • In case of an error, the parse tree constructor has found only dead ends but no leaf. It cannot tell which dead end is actually erroneous. Therefore, the error message lists dead ends of the longest paths, a fragment of their serialization and the reason why the path could not be continued at this point. The developer has to judge on his own which reason is the actual error.
  • For reasons of performance, it is critical that the parse tree constructor takes the most promising branch first and detects wrong branches early. One way to achieve this is to avoid having many rules which return the same type and which are called from within the same alternative in the grammar.

Options

SaveOptions (src) can be passed to XtextResource.save(options) (src) and to Serializer.serialize(..) (src). Available options are:

  • Formatting. Default: false. If enabled, it is the formatters job to determine all white space information during serialization. If disabled, the formatter only defines white space information for the places in which no white space information can be preserved from the node model. E.g. When new model elements are inserted or there is no node model.
  • Validating. Default: true: Run the concrete syntax validator before serializing the model.

Preserving Comments from the Node Model

The ICommentAssociater (src) associates comments with semantic objects. This is important in case an element in the semantic model is moved to a different position and the model is serialized, one expects the comments to be moved to the new position in the document as well.

Which comment belongs to which semantic object is surely a very subjective issue. The default implementation (src) behaves as follows, but can be customized:

  • If there is a semantic token before a comment and in the same line, the comment is associated with this token's semantic object.
  • In all other cases, the comment is associated with the semantic object of the next following object.

Transient Values

Transient values are values or model elements which are not persisted (written to the textual representation in the serialization phase). If a model contains model elements which can not be serialized with the current grammar, it is critical to mark them transient using the ITransientValueService (src), or serialization will fail. The default implementation marks all model elements transient, which are eStructuralFeature.isTransient() or not eObject.eIsSet(eStructuralFeature). By default, EMF returns false for eIsSet(..) if the value equals the default value.

Unassigned Text

If there are calls of data type rules or terminal rules that do not reside in an assignment, the serializer by default doesn't know which value to use for serialization.

Example:

PluralRule:
  'contents:' count=INT Plural;
  
terminal Plural: 
  'item' | 'items';

Valid models for this example are contents 1 item or contents 5 items. However, it is not stored in the semantic model whether the keyword item or items has been parsed. This is due to the fact that the rule call Plural is unassigned. However, the parse tree constructor needs to decide which value to write during serialization. This decision can be be made by customizing the IValueSerializer.serializeUnassignedValue(EObject, RuleCall, INode) (src).

Cross-Reference Serializer

The cross-reference serializer specifies which values are to be written to the textual representation for cross-references. This behavior can be customized by implementing ITokenSerializer.ICrossReferenceSerializer (src). The default implementation delegates to various other services such as the IScopeProvider (src) or the LinkingHelper (src) each of which may be the better place for customization.

Merge White Space

After the parse tree constructor has done its job to create a stream of tokens which are to be written to the textual representation, and the comment associator has done its work, existing white space form the node model is merged into the stream.

The strategy is as follows: If two tokens follow each other in the stream and the corresponding nodes in the node model follow each other as well, then the white space information in between is kept. In all other cases it is up to the formatter to calculate new white space information.

Token Stream

The parse tree constructor and the formatter use an ITokenStream (src) for their output, and the latter for its input as well. This allows for chaining the two components. Token streams can be converted to a String using the TokenStringBuffer (src) and to a Writer using the WriterTokenStream (src).

public interface ITokenStream {

  void flush() throws IOException;
  void writeHidden(EObject grammarElement, String value);
  void writeSemantic(EObject grammarElement, String value);
}