Eclipse Community Forums: TMF (Xtext) » Xtext and the Antlr lexer hell

Help

Home

Home » Modeling » TMF (Xtext) » Xtext and the Antlr lexer hell

Show: Today's Messages :: Show Polls :: Message Navigator

Xtext and the Antlr lexer hell [message #490999]

Mon, 12 October 2009 18:04

Jens Kuenzer

Messages: 29
Registered: October 2009

Junior Member

Hi,
I am trying to parse VHDL style character literal like 'x' and attribute ticks like x'HIGH. I think the generated antlr lexer can not handle this case where the ' is a literal delimiter and a token at the same time.

Here is my stripped down xtext:

grammar org.xtext.example.ApoTest
  import "http://www.eclipse.org/emf/2002/Ecore" as ecore
  generate apotest "http://www.xtext.org/example/apotest"

Model hidden(WS, COMMENT) : (name+=Name ';')*;

Name: (value=IDENTIFIER | value=CHARACTER_LITERAL) (APOSTROPHE extent+=Name | '.' extent+=Name)*;

terminal CHARACTER_LITERAL : APOSTROPHE (GRAPHIC_CHARACTER|APOSTROPHE APOSTROPHE) APOSTROPHE;
terminal APOSTROPHE : "'";

terminal COMMENT  : '--' !('\n'|'\r')* ('\r'? '\n')? ;
terminal WS : (' '|'\t'|'\r'|'\n')+ ;
terminal IDENTIFIER : ('a'..'z'|'A'..'Z') ( ( '_' )? (('a'..'z'|'A'..'Z') | ('0'..'9')) )*;
terminal GRAPHIC_CHARACTER : ('a'..'z'|'A'..'Z') | ('0'..'9') | '&' | '_' ;

And here is a testcase:

-- error with tick in this line:
bla ' error ;
-- working lines:
bla . noerror . good ;
ok . 'x' ;
'y' ;

Is it possible with xtext to support such a syntax ?

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491004 is a reply to message #490999]

Mon, 12 October 2009 18:29

Sebastian Zarnekow

Messages: 3118
Registered: July 2009

Senior Member

Hi Jens,

this is not possible with terminal rules but data type rules should work
fine.

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Jens Kuenzer schrieb:
> Hi,
> I am trying to parse VHDL style character literal like 'x' and attribute
> ticks like x'HIGH. I think the generated antlr lexer can not handle this
> case where the ' is a literal delimiter and a token at the same time.
>
> Here is my stripped down xtext:
>
> grammar org.xtext.example.ApoTest
> import "http://www.eclipse.org/emf/2002/Ecore" as ecore
> generate apotest "http://www.xtext.org/example/apotest"
>
> Model hidden(WS, COMMENT) : (name+=Name ';')*;
>
> Name: (value=IDENTIFIER | value=CHARACTER_LITERAL) (APOSTROPHE
> extent+=Name | '.' extent+=Name)*;
>
> terminal CHARACTER_LITERAL : APOSTROPHE (GRAPHIC_CHARACTER|APOSTROPHE
> APOSTROPHE) APOSTROPHE;
> terminal APOSTROPHE : "'";
>
> terminal COMMENT : '--' !('\n'|'\r')* ('\r'? '\n')? ;
> terminal WS : (' '|'\t'|'\r'|'\n')+ ;
> terminal IDENTIFIER : ('a'..'z'|'A'..'Z') ( ( '_' )?
> (('a'..'z'|'A'..'Z') | ('0'..'9')) )*;
> terminal GRAPHIC_CHARACTER : ('a'..'z'|'A'..'Z') | ('0'..'9') | '&' | '_' ;
>
>
> And here is a testcase:
>
> -- error with tick in this line:
> bla ' error ;
> -- working lines:
> bla . noerror . good ;
> ok . 'x' ;
> 'y' ;
>
>
> Is it possible with xtext to support such a syntax ?
>

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491193 is a reply to message #491004]

Tue, 13 October 2009 15:35

Jens Kuenzer

Messages: 29
Registered: October 2009

Junior Member

Thanks for guidance. I got this working but wonder why it is so tricky:

grammar org.xtext.example.ApoTest
  import "http://www.eclipse.org/emf/2002/Ecore" as ecore
  generate apotest "http://www.xtext.org/example/apotest"

Model hidden(SPACE, WS, COMMENT) : (name+=Name ';')*;

Name : isVar?=VAR? (value=IDENTIFIER | value=CHARACTER_LITERAL) (APOSTROPHE extent+=Name | DOT extent+=Name)*;

DOT hidden(SPACE, WS, COMMENT) : DOT_CHAR;
APOSTROPHE hidden(SPACE, WS, COMMENT) : APOSTROPHE_CHAR;

REHIDE hidden(SPACE, WS, COMMENT) : "^"?;

CHARACTER_LITERAL hidden() : APOSTROPHE_CHAR (GRAPHIC_CHARACTER | APOSTROPHE_CHAR APOSTROPHE_CHAR) APOSTROPHE_CHAR REHIDE;
IDENTIFIER hidden() : CHARACTER ( ( UNDERSCORE )? (CHARACTER | DIGIT) )* REHIDE;
GRAPHIC_CHARACTER : CHARACTER | DIGIT | DOT_CHAR | UNDERSCORE | SPACE | OTHER_CHAR;

terminal COMMENT  : '--' !('\n'|'\r')* ('\r'? '\n')? ;
terminal WS : ('\t'|'\r'|'\n')+ ;
terminal SPACE : ' ';

terminal VAR : ('V'|'v')('A'|'a')('R'|'r');

terminal UNDERSCORE : '_';
terminal APOSTROPHE_CHAR : "'";
terminal DOT_CHAR : '.';
terminal CHARACTER : ('a'..'z'|'A'..'Z');
terminal DIGIT : ('0'..'9');
terminal OTHER_CHAR : '/' | ':' | ';' | '<' | '=' | '>' | '|'
 | '\\' | '*' | '#' | '[' | ']' | '&' | '\'' | '(' | ')' | '+' | ',' | '-';

I still need to find a way how to handle input like:

 invar ;

Is there a way to define a empty REHIDE rule to circumvent the hidden bug ?

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491331 is a reply to message #491193]

Wed, 14 October 2009 07:40

Sebastian Zarnekow

Messages: 3118
Registered: July 2009

Senior Member

Hi Jens,

it is a bug and local hidden tokens should not require this kind of
virtual rules. However, I'm afraid there is no other workaround.
Did you file a bugzilla?

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Jens Kuenzer schrieb:
> Thanks for guidance. I got this working but wonder why it is so tricky:
>
> grammar org.xtext.example.ApoTest
> import "http://www.eclipse.org/emf/2002/Ecore" as ecore
> generate apotest "http://www.xtext.org/example/apotest"
>
> Model hidden(SPACE, WS, COMMENT) : (name+=Name ';')*;
>
> Name : isVar?=VAR? (value=IDENTIFIER | value=CHARACTER_LITERAL)
> (APOSTROPHE extent+=Name | DOT extent+=Name)*;
>
> DOT hidden(SPACE, WS, COMMENT) : DOT_CHAR;
> APOSTROPHE hidden(SPACE, WS, COMMENT) : APOSTROPHE_CHAR;
>
> REHIDE hidden(SPACE, WS, COMMENT) : "^"?;
>
> CHARACTER_LITERAL hidden() : APOSTROPHE_CHAR (GRAPHIC_CHARACTER |
> APOSTROPHE_CHAR APOSTROPHE_CHAR) APOSTROPHE_CHAR REHIDE;
> IDENTIFIER hidden() : CHARACTER ( ( UNDERSCORE )? (CHARACTER | DIGIT) )*
> REHIDE;
> GRAPHIC_CHARACTER : CHARACTER | DIGIT | DOT_CHAR | UNDERSCORE | SPACE |
> OTHER_CHAR;
>
> terminal COMMENT : '--' !('\n'|'\r')* ('\r'? '\n')? ;
> terminal WS : ('\t'|'\r'|'\n')+ ;
> terminal SPACE : ' ';
>
> terminal VAR : ('V'|'v')('A'|'a')('R'|'r');
>
> terminal UNDERSCORE : '_';
> terminal APOSTROPHE_CHAR : "'";
> terminal DOT_CHAR : '.';
> terminal CHARACTER : ('a'..'z'|'A'..'Z');
> terminal DIGIT : ('0'..'9');
> terminal OTHER_CHAR : '/' | ':' | ';' | '<' | '=' | '>' | '|'
> | '\\' | '*' | '#' | '[' | ']' | '&' | '\'' | '(' | ')' | '+' | ',' | '-';
>
> I still need to find a way how to handle input like: invar ; Is there a
> way to define a empty REHIDE rule to circumvent the hidden bug ?
>

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491395 is a reply to message #491331]

Wed, 14 October 2009 12:42

Jens Kuenzer

Messages: 29
Registered: October 2009

Junior Member

Ok the hidden() bug is in bugzilla.

But back to my never ending lexer problems:
I don't think data type rules can replace a good lexer.

Because now I have problems parsing these identifier:

varray ;
v2 ;

The first one is a "var" "ray" instead of a single identifier.
The second one just fails.

Here my current version of xtext:

grammar org.xtext.example.ApoTest
  import "http://www.eclipse.org/emf/2002/Ecore" as ecore
  generate apotest "http://www.xtext.org/example/apotest"

Model hidden(SPACE, WS, COMMENT) : (name+=Name ';')*;

Name : isVar?=VAR? (value=IDENTIFIER | value=CHARACTER_LITERAL) (APOSTROPHE extent+=Name | DOT extent+=Name)*;

DOT hidden(SPACE, WS, COMMENT) : DOT_CHAR;
APOSTROPHE hidden(SPACE, WS, COMMENT) : APOSTROPHE_CHAR;

REHIDE hidden(SPACE, WS, COMMENT) : "^"?;

CHARACTER_LITERAL hidden() : APOSTROPHE_CHAR (GRAPHIC_CHARACTER | APOSTROPHE_CHAR APOSTROPHE_CHAR) APOSTROPHE_CHAR REHIDE;
IDENTIFIER hidden() : CHARACTER ( ( UNDERSCORE )? (CHARACTER | DIGIT) )* REHIDE;
GRAPHIC_CHARACTER : CHARACTER | DIGIT | DOT_CHAR | UNDERSCORE | SPACE | OTHER_CHAR;
VAR hidden(): V A R REHIDE;
CHARACTER : (A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z);

terminal COMMENT  : '--' !('\n'|'\r')* ('\r'? '\n')? ;
terminal WS : ('\t'|'\r'|'\n')+ ;
terminal SPACE : ' ';

terminal A : ('a'|'A');
terminal B : ('b'|'B');
terminal C : ('c'|'C');
terminal D : ('d'|'D');
terminal E : ('e'|'E');
terminal F : ('f'|'F');
terminal G : ('g'|'G');
terminal H : ('h'|'H');
terminal I : ('i'|'I');
terminal J : ('j'|'J');
terminal K : ('k'|'K');
terminal L : ('l'|'L');
terminal M : ('m'|'M');
terminal N : ('n'|'N');
terminal O : ('o'|'O');
terminal P : ('p'|'P');
terminal Q : ('q'|'Q');
terminal R : ('r'|'R');
terminal S : ('s'|'S');
terminal T : ('t'|'T');
terminal U : ('u'|'U');
terminal V : ('v'|'V');
terminal W : ('w'|'W');
terminal X : ('x'|'X');
terminal Y : ('y'|'Y');
terminal Z : ('z'|'Z');

terminal UNDERSCORE : '_';
terminal APOSTROPHE_CHAR : "'";
terminal DOT_CHAR : '.';
terminal DIGIT : ('0'..'9');
terminal OTHER_CHAR : '/' | ':' | ';' | '<' | '=' | '>' | '|'
 | '\\' | '*' | '#' | '[' | ']' | '&' | '\'' | '(' | ')'
 | '+' | ',' | '-';

Is there a way to use xtext with an better external lexer?
Or are there some options like greediness in xtext?

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491496 is a reply to message #491395]

Wed, 14 October 2009 19:31

Sebastian Zarnekow

Messages: 3118
Registered: July 2009

Senior Member

Hi Jens,

there exist plenty of possiblities to tweak Xtext and the way it
instantiates models.
The 'varSomething' example may be solved with a custom IAstFactory
implementation, for example.

Maybe you should outline the actual use case so we could try to match it
to the Xtext concepts.

Regards,
Sebastian
--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Jens Kuenzer schrieb:
> Ok the hidden() bug is in bugzilla.
>
> But back to my never ending lexer problems:
> I don't think data type rules can replace a good lexer.
>
> Because now I have problems parsing these identifier:
>
> varray ;
> v2 ;
>
> The first one is a "var" "ray" instead of a single identifier.
> The second one just fails.
>
> Here my current version of xtext:
>
> grammar org.xtext.example.ApoTest
> import "http://www.eclipse.org/emf/2002/Ecore" as ecore
> generate apotest "http://www.xtext.org/example/apotest"
>
> Model hidden(SPACE, WS, COMMENT) : (name+=Name ';')*;
>
> Name : isVar?=VAR? (value=IDENTIFIER | value=CHARACTER_LITERAL)
> (APOSTROPHE extent+=Name | DOT extent+=Name)*;
>
> DOT hidden(SPACE, WS, COMMENT) : DOT_CHAR;
> APOSTROPHE hidden(SPACE, WS, COMMENT) : APOSTROPHE_CHAR;
>
> REHIDE hidden(SPACE, WS, COMMENT) : "^"?;
>
> CHARACTER_LITERAL hidden() : APOSTROPHE_CHAR (GRAPHIC_CHARACTER |
> APOSTROPHE_CHAR APOSTROPHE_CHAR) APOSTROPHE_CHAR REHIDE;
> IDENTIFIER hidden() : CHARACTER ( ( UNDERSCORE )? (CHARACTER | DIGIT) )*
> REHIDE;
> GRAPHIC_CHARACTER : CHARACTER | DIGIT | DOT_CHAR | UNDERSCORE | SPACE |
> OTHER_CHAR;
> VAR hidden(): V A R REHIDE;
> CHARACTER : (A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z);
>
> terminal COMMENT : '--' !('\n'|'\r')* ('\r'? '\n')? ;
> terminal WS : ('\t'|'\r'|'\n')+ ;
> terminal SPACE : ' ';
>
> terminal A : ('a'|'A');
> terminal B : ('b'|'B');
> terminal C : ('c'|'C');
> terminal D : ('d'|'D');
> terminal E : ('e'|'E');
> terminal F : ('f'|'F');
> terminal G : ('g'|'G');
> terminal H : ('h'|'H');
> terminal I : ('i'|'I');
> terminal J : ('j'|'J');
> terminal K : ('k'|'K');
> terminal L : ('l'|'L');
> terminal M : ('m'|'M');
> terminal N : ('n'|'N');
> terminal O : ('o'|'O');
> terminal P : ('p'|'P');
> terminal Q : ('q'|'Q');
> terminal R : ('r'|'R');
> terminal S : ('s'|'S');
> terminal T : ('t'|'T');
> terminal U : ('u'|'U');
> terminal V : ('v'|'V');
> terminal W : ('w'|'W');
> terminal X : ('x'|'X');
> terminal Y : ('y'|'Y');
> terminal Z : ('z'|'Z');
>
> terminal UNDERSCORE : '_';
> terminal APOSTROPHE_CHAR : "'";
> terminal DOT_CHAR : '.';
> terminal DIGIT : ('0'..'9');
> terminal OTHER_CHAR : '/' | ':' | ';' | '<' | '=' | '>' | '|'
> | '\\' | '*' | '#' | '[' | ']' | '&' | '\'' | '(' | ')'
> | '+' | ',' | '-';
>
> Is there a way to use xtext with an better external lexer?
> Or are there some options like greediness in xtext?
>

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #491637 is a reply to message #491496]

Thu, 15 October 2009 10:38

Jens Kuenzer

Messages: 29
Registered: October 2009

Junior Member

The use case is a vhdl compiler and my problem is rooted in the lexical analysis that is not able to handle a character literal like 'x' and a attribute marker like name'attribute.
The antlr lexical analysis detects a "name'attribute" as the identifier token "name" and a lexical error in an character literal " 'at " instead the tick token " ' " and an identifier "attribute". I am not sure whether this is a bug in Antlr or Xtext or my description of lexical tokens. I think datatype rules can not help because this introduces much more trouble detecting normal tokens.

Report message to a moderator

Re: Xtext and the Antlr lexer hell [message #492684 is a reply to message #491637]

Wed, 21 October 2009 11:52

Jens Kuenzer

Messages: 29
Registered: October 2009

Junior Member

Hi, after reverting some of the changes I have now a solution:

grammar org.xtext.example.ApoTest
  import "http://www.eclipse.org/emf/2002/Ecore" as ecore
  generate apotest "http://www.xtext.org/example/apotest"

Model hidden(SPACE, WS, COMMENT) : (name+=Name ';')*;

Name : isVar?=VAR? (value=IDENTIFIER | value=CHARACTER_LITERAL) (APOSTROPHE extent+=Name | DOT extent+=Name)*;

DOT hidden(SPACE, WS, COMMENT) : DOT_CHAR;
APOSTROPHE hidden(SPACE, WS, COMMENT) : APOSTROPHE_CHAR;

REHIDE hidden(SPACE, WS, COMMENT) : "^"?;

CHARACTER_LITERAL hidden() : APOSTROPHE_CHAR (GRAPHIC_CHARACTER | APOSTROPHE_CHAR APOSTROPHE_CHAR) APOSTROPHE_CHAR REHIDE;
GRAPHIC_CHARACTER : CHARACTER | DIGIT | DOT_CHAR | UNDERSCORE | SPACE | OTHER_CHAR;
IDENTIFIER : CHARACTER | LONG_IDENTIFIER;

terminal COMMENT  : '--' !('\n'|'\r')* ('\r'? '\n')? ;
terminal WS : ('\t'|'\r'|'\n')+ ;
terminal SPACE : ' ';

terminal VAR : ("v"|"V")("a"|"A")("r"|"R");

terminal UNDERSCORE : '_';
terminal APOSTROPHE_CHAR : "'";
terminal DOT_CHAR : '.';
terminal CHARACTER : ('a'..'z')|('A'..'Z');
terminal DIGIT : ('0'..'9');
terminal OTHER_CHAR : '/' | ':' | ';' | '<' | '=' | '>' | '|'
 | '\\' | '*' | '#' | '[' | ']' | '&' | '\'' | '(' | ')'
 | '+' | ',' | '-';

terminal LONG_IDENTIFIER : CHARACTER ( ( UNDERSCORE )? (CHARACTER | DIGIT) )*;

The problem was the change of the IDENTIFIER rule to be a datatype rule.
Once reverted most of it back to a terminal rule it seems to work.
The trick here was the IDENTIFIER datatype rule because the CHARACTER rule matched better than the LONG_IDENTIFIER rule. Even a single character also matches a LONG_IDENTIFIER this is hidden by CHARACTER rule.
Maybe the xtext documentation of terminal tokens could clearify which terminal token matches first. Or another good idea would be a warning that parts of a terminal token is hidden by other tokens.

Report message to a moderator

Previous Topic:	Xpand imports - No Definition 'templates::Template::main for xpand2::Type' found!
Next Topic:	Left factoring

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Sep 20 12:10:49 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter