Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » Lexer/Parser issue
Lexer/Parser issue [message #635203] Tue, 26 October 2010 00:25 Go to next message
Mirko Raner is currently offline Mirko RanerFriend
Messages: 125
Registered: July 2009
Location: New York City, NY
Senior Member
Hi all,

I'm struggling with the Xtext implementation of something that could be described as an SGML grammar with some weird quirks.

Specifically, the input files contain SGML-style references, i.e. "&referenceName;", as well as "&" characters that are not part of an entity reference (for example, in URLs like "http://localhost/servlet?x=1&y=2", or in C-style expressions like "WEB && !PRO"). To build a proper semantic model, it is important to parse all correct and complete entity references as a single element. I have no specific needs how individual (i.e. non-entity) ampersands are parsed, but the parser must understand both uses without issuing an error.

My first approach was to define two terminals, to make sure that entity references are returned by the lexer as a single token:

terminal SEMICOLON: ';';
terminal IDENTIFIER: ('A'..'Z'|'a'..'z') ('A'..'Z'|'a'..'z'|'0'..'9')*;
terminal ENTITY_REF: AMPERSAND IDENTIFIER SEMICOLON;
terminal AMPERSAND: '&';

My idea was that the lexer would greedily match ENTITY_REF if it could, and fall back to just matching a single AMPERSAND if not. Well, I guess it doesn't work that way, because no matter in which order I tried it
I didn't get the right results. Am I correct that this cannot be solved on a lexer level?

I switched tactics to using parser rules instead, which parsed the entities just fine, but I kept having difficulties with recognizing things like the non-entity "&&" and "&y=2" uses.

As always, I can't post the grammar because it's proprietary and owned by my employer. If necessary I can probably piece together a simplified grammar that demonstrates my problems, but I was hoping for some general guidance about how to solve these sorts of issues. We're using Xtext 1.0.1 with the ANTLR generator.

Thanks in advance,

Mirko
Re: Lexer/Parser issue [message #635230 is a reply to message #635203] Tue, 26 October 2010 06:32 Go to previous messageGo to next message
Meinte Boersma is currently offline Meinte BoersmaFriend
Messages: 434
Registered: July 2009
Location: Leiden, Netherlands
Senior Member
Usually, this stuff is best solved at the parser level (i.e., post-lexing), possibly with the use of data types and value converters.

Could you be a bit more specific about 'I kept having difficulties with recognizing things like the non-entity "&&" and "&y=2" uses.'? A simplified/obfuscated version your grammar which retains the problems you're facing would help us in helping you.


Re: Lexer/Parser issue [message #635477 is a reply to message #635230] Wed, 27 October 2010 01:14 Go to previous messageGo to next message
Mirko Raner is currently offline Mirko RanerFriend
Messages: 125
Registered: July 2009
Location: New York City, NY
Senior Member
Thanks, Meinte. As you suggested I refocused my efforts on parser rules, and I got a little further.
I was unaware that, in an alternative, the order of the different options actually matters (i.e., Rule1|Rule2|Rule3 is not equivalent to Rule3|Rule2|Rule1, especially when backtracking is enabled and there are potentially ambiguous rules). After re-ordering some of my alternatives, I got my parser to parse &entity; as well as uses of "&" that don't match that pattern.

Now I'm stuck with a slightly different problem. The following grammar parses everything that I need it to parse:

EntityReference: entity=EntityRef;
EntityRef: AMPERSAND IDENTIFIER SEMICOLON;
Special: text=(QUOTE|SINGLEQUOTE|DASHDASH|GT);
Text:
(entities+=EntityReference|specials+=Special|text+=PlainText )+;
PlainText: {PlainText}
(PCDATA|INDEX|COLON|DASH|DOT|SEMICOLON|AMPERSAND
|NameOrKeyword|'='|'['|']'|'\\');
NameOrKeyword: IDENTIFIER|Keywords;

However, for the Text rule, I'm loosing easy access to the ordered token stream because everything gets separated into entities, special characters, and other text. To just get an ordered sequence of EObjects I changed the Text rule as follows:

Text:
elements+=(EntityReference|Special|PlainText)+;

However, that version of the grammar no longer parses individual "&" characters: required (...)+ loop did not match anything at input '&'
Somehow, the individual "&" (part of the PlainText production) is properly recognized when EntityReferences, Specials, and PlainText are stored in separate lists, but not when they are all stored in a single list of EObjects.

Any ideas why that's the case and how I could work around it?

Re: Lexer/Parser issue [message #635495 is a reply to message #635477] Wed, 27 October 2010 06:40 Go to previous messageGo to next message
Meinte Boersma is currently offline Meinte BoersmaFriend
Messages: 434
Registered: July 2009
Location: Leiden, Netherlands
Senior Member
elements+=(EntityReference|Special|PlainText)+

might not the same thing as
(elements+=(EntityReference|Special|PlainText))+

so you might want to try the latter form.


Re: Lexer/Parser issue [message #635519 is a reply to message #635495] Wed, 27 October 2010 07:32 Go to previous messageGo to next message
Sebastian Zarnekow is currently offline Sebastian ZarnekowFriend
Messages: 3118
Registered: July 2009
Senior Member
Hi Meinte,

they mean exactly the same thing.

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Am 27.10.10 08:40, schrieb Meinte Boersma:
>
> elements+=(EntityReference|Special|PlainText)+
>
> might not the same thing as
>
> (elements+=(EntityReference|Special|PlainText))+
>
> so you might want to try the latter form.
Re: Lexer/Parser issue [message #635689 is a reply to message #635519] Wed, 27 October 2010 17:57 Go to previous messageGo to next message
Mirko Raner is currently offline Mirko RanerFriend
Messages: 125
Registered: July 2009
Location: New York City, NY
Senior Member
Thanks, Meinte and Sebastian.
I tried the extra parentheses, and, in line with Sebastian's comment, it didn't make any difference. I came up with the following variation that finally works for all my test cases:

Text:
(elements+=EntityReference|elements+=Special|elements+=Plain Text)+;

Intuitively, I would assume that this would have the same semantics as

Text:
elements+=(EntityReference|Special|PlainText)+;

But reality proves that that's not the case. Can someone shed some light on why the two variations generate different parsers that have different behavior?
It seems like I'm missing an important point about the Xtext grammar description language here...
Re: Lexer/Parser issue [message #635707 is a reply to message #635689] Wed, 27 October 2010 19:08 Go to previous messageGo to next message
Sebastian Zarnekow is currently offline Sebastian ZarnekowFriend
Messages: 3118
Registered: July 2009
Senior Member
Hi Mirko,

it should make no difference. Could you please file a bug with two
grammars attached that illustrate the issue?

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Am 27.10.10 19:57, schrieb Mirko Raner:
> Thanks, Meinte and Sebastian.
> I tried the extra parentheses, and, in line with Sebastian's comment, it
> didn't make any difference. I came up with the following variation that
> finally works for all my test cases:
>
> Text:
> (elements+=EntityReference|elements+=Special|elements+=Plain Text)+;
>
> Intuitively, I would assume that this would have the same semantics as
>
> Text:
> elements+=(EntityReference|Special|PlainText)+;
>
> But reality proves that that's not the case. Can someone shed some light
> on why the two variations generate different parsers that have different
> behavior?
> It seems like I'm missing an important point about the Xtext grammar
> description language here...
>
Re: Lexer/Parser issue [message #636506 is a reply to message #635707] Mon, 01 November 2010 15:42 Go to previous message
Mirko Raner is currently offline Mirko RanerFriend
Messages: 125
Registered: July 2009
Location: New York City, NY
Senior Member
Hi Sebastian,

I filed https://bugs.eclipse.org/329125 and attached some projects that illustrate this behavior.

HTH,

Mirko
Previous Topic:xtext for existing compiler/language
Next Topic:Assign parts of input to string property
Goto Forum:
  


Current Time: Thu Apr 18 17:57:15 GMT 2024

Powered by FUDForum. Page generated in 0.02195 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top