Home » Modeling » TMF (Xtext) » Lexer/Parser issue
Lexer/Parser issue [message #635203] |
Tue, 26 October 2010 00:25 |
Mirko Raner Messages: 125 Registered: July 2009 Location: New York City, NY |
Senior Member |
|
|
Hi all,
I'm struggling with the Xtext implementation of something that could be described as an SGML grammar with some weird quirks.
Specifically, the input files contain SGML-style references, i.e. "&referenceName;", as well as "&" characters that are not part of an entity reference (for example, in URLs like "http://localhost/servlet?x=1&y=2", or in C-style expressions like "WEB && !PRO"). To build a proper semantic model, it is important to parse all correct and complete entity references as a single element. I have no specific needs how individual (i.e. non-entity) ampersands are parsed, but the parser must understand both uses without issuing an error.
My first approach was to define two terminals, to make sure that entity references are returned by the lexer as a single token:
terminal SEMICOLON: ';';
terminal IDENTIFIER: ('A'..'Z'|'a'..'z') ('A'..'Z'|'a'..'z'|'0'..'9')*;
terminal ENTITY_REF: AMPERSAND IDENTIFIER SEMICOLON;
terminal AMPERSAND: '&';
My idea was that the lexer would greedily match ENTITY_REF if it could, and fall back to just matching a single AMPERSAND if not. Well, I guess it doesn't work that way, because no matter in which order I tried it
I didn't get the right results. Am I correct that this cannot be solved on a lexer level?
I switched tactics to using parser rules instead, which parsed the entities just fine, but I kept having difficulties with recognizing things like the non-entity "&&" and "&y=2" uses.
As always, I can't post the grammar because it's proprietary and owned by my employer. If necessary I can probably piece together a simplified grammar that demonstrates my problems, but I was hoping for some general guidance about how to solve these sorts of issues. We're using Xtext 1.0.1 with the ANTLR generator.
Thanks in advance,
Mirko
|
|
| |
Re: Lexer/Parser issue [message #635477 is a reply to message #635230] |
Wed, 27 October 2010 01:14 |
Mirko Raner Messages: 125 Registered: July 2009 Location: New York City, NY |
Senior Member |
|
|
Thanks, Meinte. As you suggested I refocused my efforts on parser rules, and I got a little further.
I was unaware that, in an alternative, the order of the different options actually matters (i.e., Rule1|Rule2|Rule3 is not equivalent to Rule3|Rule2|Rule1, especially when backtracking is enabled and there are potentially ambiguous rules). After re-ordering some of my alternatives, I got my parser to parse &entity; as well as uses of "&" that don't match that pattern.
Now I'm stuck with a slightly different problem. The following grammar parses everything that I need it to parse:
EntityReference: entity=EntityRef;
EntityRef: AMPERSAND IDENTIFIER SEMICOLON;
Special: text=(QUOTE|SINGLEQUOTE|DASHDASH|GT);
Text:
(entities+=EntityReference|specials+=Special|text+=PlainText )+;
PlainText: {PlainText}
(PCDATA|INDEX|COLON|DASH|DOT|SEMICOLON|AMPERSAND
|NameOrKeyword|'='|'['|']'|'\\');
NameOrKeyword: IDENTIFIER|Keywords;
However, for the Text rule, I'm loosing easy access to the ordered token stream because everything gets separated into entities, special characters, and other text. To just get an ordered sequence of EObjects I changed the Text rule as follows:
Text:
elements+=(EntityReference|Special|PlainText)+;
However, that version of the grammar no longer parses individual "&" characters: required (...)+ loop did not match anything at input '&'
Somehow, the individual "&" (part of the PlainText production) is properly recognized when EntityReferences, Specials, and PlainText are stored in separate lists, but not when they are all stored in a single list of EObjects.
Any ideas why that's the case and how I could work around it?
|
|
| | | | | |
Goto Forum:
Current Time: Thu Apr 18 17:57:15 GMT 2024
Powered by FUDForum. Page generated in 0.02195 seconds
|