Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » how to deal with a non context free lexing issue
how to deal with a non context free lexing issue [message #641325] Thu, 25 November 2010 01:47 Go to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
Hi,
I am investigating how to support an existing language with an Xtext
based implementation (for tooling).

Unfortunately, the language design is not completely context free - it
has support for literal regular expression that clashes with other
terminals - it can however only appear in certain contexts.

The regular expression is written within / / requiring a \ before a / if
an (otherwise terminating) slash should be included in the regex.

This clashes with the multiline comments /* */ and single line comments
#...to end of line.

How do you recommend handling an issue like this?

Regards
- henrik
Re: how to deal with a non context free lexing issue [message #641371 is a reply to message #641325] Thu, 25 November 2010 08:53 Go to previous messageGo to next message
Meinte Boersma is currently offline Meinte BoersmaFriend
Messages: 434
Registered: July 2009
Location: Leiden, Netherlands
Senior Member
I've used a terminal def. of
terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';

in combination with
TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;

and that didn't give me any problems whatsoever. (And yes, this looks a bit like a grammar for Xtext Wink)


Re: how to deal with a non context free lexing issue [message #641440 is a reply to message #641371] Thu, 25 November 2010 13:26 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
I knew that :)
The issue is to also handle division, string, ml and sl comments.

If pattern is a terminal it will eat:
10 / 2 / 5
(which is two divisions producing 1)

If pattern is not a terminal, the sl comment rule will eat half the
rexex:
/abc#de/

If sl comment is included in a non terminal pattern, non commented
tokens after the regex are eaten:
matches(/a#/, "I am eaten by sl terminal")

And so on...
- henrik

Meinte Boersma <meinte.boersma@gmail.com> wrote:
> I've used a terminal def. of
>
> terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';
>
> in combination with
>
> TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;
>
> and that didn't give me any problems whatsoever. (And yes, this looks
> a bit like a grammar for Xtext ;))



--
- henrik
Re: how to deal with a non context free lexing issue [message #641512 is a reply to message #641440] Thu, 25 November 2010 18:20 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
So,
How can I make this work with Xtext?

- henrik
On 11/25/10 2:26 PM, Henrik Lindberg wrote:
> I knew that :)
> The issue is to also handle division, string, ml and sl comments.
>
> If pattern is a terminal it will eat:
> 10 / 2 / 5
> (which is two divisions producing 1)
>
> If pattern is not a terminal, the sl comment rule will eat half the
> rexex:
> /abc#de/
>
> If sl comment is included in a non terminal pattern, non commented
> tokens after the regex are eaten:
> matches(/a#/, "I am eaten by sl terminal")
>
> And so on...
> - henrik
>
> Meinte Boersma<meinte.boersma@gmail.com> wrote:
>> I've used a terminal def. of
>>
>> terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';
>>
>> in combination with
>>
>> TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;
>>
>> and that didn't give me any problems whatsoever. (And yes, this looks
>> a bit like a grammar for Xtext ;))
>
>
>
Re: how to deal with a non context free lexing issue [message #641529 is a reply to message #641512] Thu, 25 November 2010 20:58 Go to previous messageGo to next message
Knut Wannheden is currently offline Knut WannhedenFriend
Messages: 298
Registered: July 2009
Senior Member
Hi Henrik

Maybe you can find an Antlr grammar solving the same problem (JavaScript may be a good candidate) using syntactic predicates. You could then add an Xpand post processor to modify the Xtext generated Antlr grammar as required before it's written to disk.

This is of course a bit of a hack Smile

Cheers,

--knut
Re: how to deal with a non context free lexing issue [message #641539 is a reply to message #641529] Thu, 25 November 2010 23:30 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
Thanks Knut,
I started down that path - learning about antlr syntactic predicates now...

- henrik

On 11/25/10 9:58 PM, Knut Wannheden wrote:
> Hi Henrik
>
> Maybe you can find an Antlr grammar solving the same problem (JavaScript
> may be a good candidate) using syntactic predicates. You could then add
> an Xpand post processor to modify the Xtext generated Antlr grammar as
> required before it's written to disk.
> This is of course a bit of a hack :)
>
> Cheers,
>
> --knut
Re: how to deal with a non context free lexing issue [message #641547 is a reply to message #641529] Fri, 26 November 2010 02:02 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
Found an EcmaScript antlr grammar here:
http://www.antlr.org/grammar/1153976512034/ecmascriptA3.g

It is interesting to see that the solutions seems to be to use very
simple terminals and define the notion of strings, single and multi line
comment etc. as parser rules.

Also, with "simple terminals" the grammar uses a 'positive' list of all
possible "letters" etc. instead of using the not-operator.

My initial instinct was to try to find some way to tell the lexer to
shift its "input mode" (essentially forget certain terminal rules
depending on context) - but after reading both the Ecma script grammar,
and (the relevant parts) in the ANTLR book that this is an idea that I
should abandon.

Any thoughts on my reflections?
- henrik

On 11/25/10 9:58 PM, Knut Wannheden wrote:
> Hi Henrik
>
> Maybe you can find an Antlr grammar solving the same problem (JavaScript
> may be a good candidate) using syntactic predicates. You could then add
> an Xpand post processor to modify the Xtext generated Antlr grammar as
> required before it's written to disk.
> This is of course a bit of a hack :)
>
> Cheers,
>
> --knut
Re: how to deal with a non context free lexing issue [message #644103 is a reply to message #641547] Thu, 09 December 2010 14:10 Go to previous messageGo to next message
Jonathan is currently offline JonathanFriend
Messages: 6
Registered: December 2010
Junior Member
Hello,

I had the same problem than you. To solve it, just turned off regexp recognition when there cannot be any (after an ID for instance). I did that in my lexer, manually (I inherit from the InternalLexer in src-gen). This is the solution proposed in ES3.g grammar, on the ANTLR website.
Consequently, my regular expressions are scanned by the lexer, with this rule :

terminal REGEXP:
'/'
(
!('/'|'\\'|'\n'|'\r'|'*') | ('\\' !('\n'|'\r'))
)
(
!('/'|'\\'|'\n'|'\r') | ('\\' !('\n'|'\r'))
)*
'/'
('a'..'z' | 'A'..'Z' | '0'..'9')*
;

Cheers,

Jonathan
Re: how to deal with a non context free lexing issue [message #644187 is a reply to message #644103] Thu, 09 December 2010 19:26 Go to previous messageGo to next message
Henrik Lindberg is currently offline Henrik LindbergFriend
Messages: 2509
Registered: July 2009
Senior Member
Thanks for the info, I don't know if my rule for turning on/off is as
simple as "following and ID", but certainly worth looking into.

I have not looked deep into how to manually override something like this
in the lexer (as I took a different approach) - is your code available
somewhere?

Regards
- henrik

On 12/9/10 3:10 PM, Jonathan wrote:
> Hello,
>
> I had the same problem than you. To solve it, just turned off regexp
> recognition when there cannot be any (after an ID for instance). I did
> that in my lexer, manually (I inherit from the InternalLexer in
> src-gen). This is the solution proposed in ES3.g grammar, on the ANTLR
> website.
> Consequently, my regular expressions are scanned by the lexer, with this
> rule :
>
> terminal REGEXP:
> '/'
> (
> !('/'|'\\'|'\n'|'\r'|'*') | ('\\' !('\n'|'\r'))
> )
> (
> !('/'|'\\'|'\n'|'\r') | ('\\' !('\n'|'\r'))
> )*
> '/'
> ('a'..'z' | 'A'..'Z' | '0'..'9')*
> ;
>
> Cheers,
>
> Jonathan
Re: how to deal with a non context free lexing issue [message #647108 is a reply to message #644187] Tue, 04 January 2011 16:12 Go to previous message
Jonathan is currently offline JonathanFriend
Messages: 6
Registered: December 2010
Junior Member
The code is huge because I have to copy the whole mTokens() method from the lexer, as rule-specific methods are marked 'final'. To summarize, I created a subclass of the internal lexer (myPackage.parser.antlr.internal.InternalMyDslLexer.java) and I wrote :

        public MyDslLexer() {
		super();
	}

	public MyDslLexer(CharStream input) {
		super(input);
	}

        private Token last;

        private final boolean areRegexEnabled() {
		if (last == null)
			return true;

		switch (last.getType()) {

		// identifier
		case RULE_ID:

		// literals
		case T45: // 'this'
		case T89: // 'true'
		case T90: // 'false'
		case T91: //  'null'
		case RULE_NUMBER:
		case RULE_HEX_NUMBER:
		case RULE_STRING:

		// member access ending
		case T21: // ']'

		// function call or nested expression ending
		case T15: // ')'
			return false;

		// otherwise OK
		default:
			return true;
		}
        }

	@Override
	public Token nextToken() {
		Token result = super.nextToken();
		if (!isHiddenToken(result))
			last = result;
		return result;
	}

	public boolean isHiddenToken(Token t) {
		int type = t.getType();
		return type == RULE_WS | type == RULE_ML_COMMENT
				| type == RULE_SL_COMMENT;
	}


The constraint is to know which number identifies which token (this can be found in myPackage.parser.antlr.internal.InternalMyDsl.tokens). So this hack must be updated at each grammar modification...

Then, you override mTokens(). Every time the ambiguous rule is called, use the methods above to decide which rule to call. I wrote :

        if (areRegexEnabled())
		mRULE_REGEX(); // rule for regular expressions
	else
		mT69(); // Rule for divide operator '/'


I know this is not a proper way to do that, but it works...
When I have more time, I will write an Xpand postprocessor to modify the generated lexer automatically, as suggested in another post. So I would just have to change mRULE_REGEX() instead of mTokens().

Cheers,

[Updated on: Tue, 04 January 2011 16:14]

Report message to a moderator

Previous Topic:AltGr and Ctrl key interact with content assist
Next Topic:Outline View not working in Xtext 1.0.1?
Goto Forum:
  


Current Time: Sat Apr 27 01:02:53 GMT 2024

Powered by FUDForum. Page generated in 0.04912 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top