Eclipse Community Forums: TMF (Xtext)

Home » Modeling » TMF (Xtext) » Lexer/Parser issue

Mon, 25 October 2010 20:25

Eclipse User

Hi all,

I'm struggling with the Xtext implementation of something that could be described as an SGML grammar with some weird quirks.

Specifically, the input files contain SGML-style references, i.e. "&referenceName;", as well as "&" characters that are not part of an entity reference (for example, in URLs like "http://localhost/servlet?x=1&y=2", or in C-style expressions like "WEB && !PRO"). To build a proper semantic model, it is important to parse all correct and complete entity references as a single element. I have no specific needs how individual (i.e. non-entity) ampersands are parsed, but the parser must understand both uses without issuing an error.

My first approach was to define two terminals, to make sure that entity references are returned by the lexer as a single token:

terminal SEMICOLON: ';';
terminal IDENTIFIER: ('A'..'Z'|'a'..'z') ('A'..'Z'|'a'..'z'|'0'..'9')*;
terminal ENTITY_REF: AMPERSAND IDENTIFIER SEMICOLON;
terminal AMPERSAND: '&';

My idea was that the lexer would greedily match ENTITY_REF if it could, and fall back to just matching a single AMPERSAND if not. Well, I guess it doesn't work that way, because no matter in which order I tried it
I didn't get the right results. Am I correct that this cannot be solved on a lexer level?

I switched tactics to using parser rules instead, which parsed the entities just fine, but I kept having difficulties with recognizing things like the non-entity "&&" and "&y=2" uses.

As always, I can't post the grammar because it's proprietary and owned by my employer. If necessary I can probably piece together a simplified grammar that demonstrates my problems, but I was hoping for some general guidance about how to solve these sorts of issues. We're using Xtext 1.0.1 with the ANTLR generator.

Thanks in advance,

Mirko

Re: Lexer/Parser issue [message #635230 is a reply to message #635203]

Tue, 26 October 2010 02:32

Eclipse User

Usually, this stuff is best solved at the parser level (i.e., post-lexing), possibly with the use of data types and value converters.

Could you be a bit more specific about 'I kept having difficulties with recognizing things like the non-entity "&&" and "&y=2" uses.'? A simplified/obfuscated version your grammar which retains the problems you're facing would help us in helping you.

Re: Lexer/Parser issue [message #635477 is a reply to message #635230]

Tue, 26 October 2010 21:14

Eclipse User

Thanks, Meinte. As you suggested I refocused my efforts on parser rules, and I got a little further.
I was unaware that, in an alternative, the order of the different options actually matters (i.e., Rule1|Rule2|Rule3 is not equivalent to Rule3|Rule2|Rule1, especially when backtracking is enabled and there are potentially ambiguous rules). After re-ordering some of my alternatives, I got my parser to parse &entity; as well as uses of "&" that don't match that pattern.

Now I'm stuck with a slightly different problem. The following grammar parses everything that I need it to parse:

EntityReference: entity=EntityRef;
EntityRef: AMPERSAND IDENTIFIER SEMICOLON;
Special: text=(QUOTE|SINGLEQUOTE|DASHDASH|GT);
Text:
(entities+=EntityReference|specials+=Special|text+=PlainText )+;
PlainText: {PlainText}
(PCDATA|INDEX|COLON|DASH|DOT|SEMICOLON|AMPERSAND
|NameOrKeyword|'='|'['|']'|'\\');
NameOrKeyword: IDENTIFIER|Keywords;

However, for the Text rule, I'm loosing easy access to the ordered token stream because everything gets separated into entities, special characters, and other text. To just get an ordered sequence of EObjects I changed the Text rule as follows:

Text:
elements+=(EntityReference|Special|PlainText)+;

However, that version of the grammar no longer parses individual "&" characters: required (...)+ loop did not match anything at input '&'
Somehow, the individual "&" (part of the PlainText production) is properly recognized when EntityReferences, Specials, and PlainText are stored in separate lists, but not when they are all stored in a single list of EObjects.

Any ideas why that's the case and how I could work around it?

Re: Lexer/Parser issue [message #635495 is a reply to message #635477]

Wed, 27 October 2010 02:40

Eclipse User

elements+=(EntityReference|Special|PlainText)+

might not the same thing as

(elements+=(EntityReference|Special|PlainText))+

so you might want to try the latter form.

Re: Lexer/Parser issue [message #635519 is a reply to message #635495]

Wed, 27 October 2010 03:32

Eclipse User

Hi Meinte,

they mean exactly the same thing.

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Am 27.10.10 08:40, schrieb Meinte Boersma:
>
> elements+=(EntityReference|Special|PlainText)+
>
> might not the same thing as
>
> (elements+=(EntityReference|Special|PlainText))+
>
> so you might want to try the latter form.

Re: Lexer/Parser issue [message #635689 is a reply to message #635519]

Wed, 27 October 2010 13:57

Eclipse User

Thanks, Meinte and Sebastian.
I tried the extra parentheses, and, in line with Sebastian's comment, it didn't make any difference. I came up with the following variation that finally works for all my test cases:

Text:
(elements+=EntityReference|elements+=Special|elements+=Plain Text)+;

Intuitively, I would assume that this would have the same semantics as

Text:
elements+=(EntityReference|Special|PlainText)+;

But reality proves that that's not the case. Can someone shed some light on why the two variations generate different parsers that have different behavior?
It seems like I'm missing an important point about the Xtext grammar description language here...

Re: Lexer/Parser issue [message #635707 is a reply to message #635689]

Wed, 27 October 2010 15:08

Eclipse User

Hi Mirko,

it should make no difference. Could you please file a bug with two
grammars attached that illustrate the issue?

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Am 27.10.10 19:57, schrieb Mirko Raner:
> Thanks, Meinte and Sebastian.
> I tried the extra parentheses, and, in line with Sebastian's comment, it
> didn't make any difference. I came up with the following variation that
> finally works for all my test cases:
>
> Text:
> (elements+=EntityReference|elements+=Special|elements+=Plain Text)+;
>
> Intuitively, I would assume that this would have the same semantics as
>
> Text:
> elements+=(EntityReference|Special|PlainText)+;
>
> But reality proves that that's not the case. Can someone shed some light
> on why the two variations generate different parsers that have different
> behavior?
> It seems like I'm missing an important point about the Xtext grammar
> description language here...
>

Re: Lexer/Parser issue [message #636506 is a reply to message #635707]

Mon, 01 November 2010 11:42

Eclipse User

Hi Sebastian,

I filed https://bugs.eclipse.org/329125 and attached some projects that illustrate this behavior.

HTH,

Mirko

Previous Topic:	xtext for existing compiler/language
Next Topic:	Assign parts of input to string property

Goto Forum:

-=] Back to Top [=-

Current Time: Fri Jul 04 11:02:26 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter