Eclipse Community Forums: TMF (Xtext) » how to deal with a non context free lexing issue

Home » Modeling » TMF (Xtext) » how to deal with a non context free lexing issue

how to deal with a non context free lexing issue [message #641325]

Wed, 24 November 2010 20:47

Eclipse User

Hi,
I am investigating how to support an existing language with an Xtext
based implementation (for tooling).

Unfortunately, the language design is not completely context free - it
has support for literal regular expression that clashes with other
terminals - it can however only appear in certain contexts.

The regular expression is written within / / requiring a \ before a / if
an (otherwise terminating) slash should be included in the regex.

This clashes with the multiline comments /* */ and single line comments
#...to end of line.

How do you recommend handling an issue like this?

Regards
- henrik

Re: how to deal with a non context free lexing issue [message #641371 is a reply to message #641325]

Thu, 25 November 2010 03:53

Eclipse User

I've used a terminal def. of

terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';

in combination with

TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;

and that didn't give me any problems whatsoever. (And yes, this looks a bit like a grammar for Xtext Wink

)

Re: how to deal with a non context free lexing issue [message #641440 is a reply to message #641371]

Thu, 25 November 2010 08:26

Eclipse User

I knew that :)
The issue is to also handle division, string, ml and sl comments.

If pattern is a terminal it will eat:
10 / 2 / 5
(which is two divisions producing 1)

If pattern is not a terminal, the sl comment rule will eat half the
rexex:
/abc#de/

If sl comment is included in a non terminal pattern, non commented
tokens after the regex are eaten:
matches(/a#/, "I am eaten by sl terminal")

And so on...
- henrik

Meinte Boersma <meinte.boersma@gmail.com> wrote:
> I've used a terminal def. of
>
> terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';
>
> in combination with
>
> TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;
>
> and that didn't give me any problems whatsoever. (And yes, this looks
> a bit like a grammar for Xtext ;))

--
- henrik

Re: how to deal with a non context free lexing issue [message #641512 is a reply to message #641440]

Thu, 25 November 2010 13:20

Eclipse User

So,
How can I make this work with Xtext?

- henrik
On 11/25/10 2:26 PM, Henrik Lindberg wrote:
> I knew that :)
> The issue is to also handle division, string, ml and sl comments.
>
> If pattern is a terminal it will eat:
> 10 / 2 / 5
> (which is two divisions producing 1)
>
> If pattern is not a terminal, the sl comment rule will eat half the
> rexex:
> /abc#de/
>
> If sl comment is included in a non terminal pattern, non commented
> tokens after the regex are eaten:
> matches(/a#/, "I am eaten by sl terminal")
>
> And so on...
> - henrik
>
> Meinte Boersma<meinte.boersma@gmail.com> wrote:
>> I've used a terminal def. of
>>
>> terminal PATTERN_STRING: '/' ( ( '\\' '/' ) | ( !( '/' ) ) )* '/';
>>
>> in combination with
>>
>> TerminalRule: 'terminal' name=ID ':' regexp=PATTERN_STRING;
>>
>> and that didn't give me any problems whatsoever. (And yes, this looks
>> a bit like a grammar for Xtext ;))
>
>
>

Re: how to deal with a non context free lexing issue [message #641529 is a reply to message #641512]

Thu, 25 November 2010 15:58

Eclipse User

Hi Henrik

Maybe you can find an Antlr grammar solving the same problem (JavaScript may be a good candidate) using syntactic predicates. You could then add an Xpand post processor to modify the Xtext generated Antlr grammar as required before it's written to disk.

This is of course a bit of a hack Smile

Cheers,

--knut

Re: how to deal with a non context free lexing issue [message #641539 is a reply to message #641529]

Thu, 25 November 2010 18:30

Eclipse User

Thanks Knut,
I started down that path - learning about antlr syntactic predicates now...

- henrik

On 11/25/10 9:58 PM, Knut Wannheden wrote:
> Hi Henrik
>
> Maybe you can find an Antlr grammar solving the same problem (JavaScript
> may be a good candidate) using syntactic predicates. You could then add
> an Xpand post processor to modify the Xtext generated Antlr grammar as
> required before it's written to disk.
> This is of course a bit of a hack :)
>
> Cheers,
>
> --knut

Re: how to deal with a non context free lexing issue [message #641547 is a reply to message #641529]

Thu, 25 November 2010 21:02

Eclipse User

Found an EcmaScript antlr grammar here:
http://www.antlr.org/grammar/1153976512034/ecmascriptA3.g

It is interesting to see that the solutions seems to be to use very
simple terminals and define the notion of strings, single and multi line
comment etc. as parser rules.

Also, with "simple terminals" the grammar uses a 'positive' list of all
possible "letters" etc. instead of using the not-operator.

My initial instinct was to try to find some way to tell the lexer to
shift its "input mode" (essentially forget certain terminal rules
depending on context) - but after reading both the Ecma script grammar,
and (the relevant parts) in the ANTLR book that this is an idea that I
should abandon.

Any thoughts on my reflections?
- henrik

On 11/25/10 9:58 PM, Knut Wannheden wrote:
> Hi Henrik
>
> Maybe you can find an Antlr grammar solving the same problem (JavaScript
> may be a good candidate) using syntactic predicates. You could then add
> an Xpand post processor to modify the Xtext generated Antlr grammar as
> required before it's written to disk.
> This is of course a bit of a hack :)
>
> Cheers,
>
> --knut

Re: how to deal with a non context free lexing issue [message #644103 is a reply to message #641547]

Thu, 09 December 2010 09:10

Eclipse User

Hello,

I had the same problem than you. To solve it, just turned off regexp recognition when there cannot be any (after an ID for instance). I did that in my lexer, manually (I inherit from the InternalLexer in src-gen). This is the solution proposed in ES3.g grammar, on the ANTLR website.
Consequently, my regular expressions are scanned by the lexer, with this rule :

terminal REGEXP:
'/'
(
!('/'|'\\'|'\n'|'\r'|'*') | ('\\' !('\n'|'\r'))
)
(
!('/'|'\\'|'\n'|'\r') | ('\\' !('\n'|'\r'))
)*
'/'
('a'..'z' | 'A'..'Z' | '0'..'9')*
;

Cheers,

Jonathan

Re: how to deal with a non context free lexing issue [message #644187 is a reply to message #644103]

Thu, 09 December 2010 14:26

Eclipse User

Thanks for the info, I don't know if my rule for turning on/off is as
simple as "following and ID", but certainly worth looking into.

I have not looked deep into how to manually override something like this
in the lexer (as I took a different approach) - is your code available
somewhere?

Regards
- henrik

On 12/9/10 3:10 PM, Jonathan wrote:
> Hello,
>
> I had the same problem than you. To solve it, just turned off regexp
> recognition when there cannot be any (after an ID for instance). I did
> that in my lexer, manually (I inherit from the InternalLexer in
> src-gen). This is the solution proposed in ES3.g grammar, on the ANTLR
> website.
> Consequently, my regular expressions are scanned by the lexer, with this
> rule :
>
> terminal REGEXP:
> '/'
> (
> !('/'|'\\'|'\n'|'\r'|'*') | ('\\' !('\n'|'\r'))
> )
> (
> !('/'|'\\'|'\n'|'\r') | ('\\' !('\n'|'\r'))
> )*
> '/'
> ('a'..'z' | 'A'..'Z' | '0'..'9')*
> ;
>
> Cheers,
>
> Jonathan

Re: how to deal with a non context free lexing issue [message #647108 is a reply to message #644187]

Tue, 04 January 2011 11:12

Eclipse User

The code is huge because I have to copy the whole mTokens() method from the lexer, as rule-specific methods are marked 'final'. To summarize, I created a subclass of the internal lexer (myPackage.parser.antlr.internal.InternalMyDslLexer.java) and I wrote :

        public MyDslLexer() {
		super();
	}

	public MyDslLexer(CharStream input) {
		super(input);
	}

        private Token last;

        private final boolean areRegexEnabled() {
		if (last == null)
			return true;

		switch (last.getType()) {

		// identifier
		case RULE_ID:

		// literals
		case T45: // 'this'
		case T89: // 'true'
		case T90: // 'false'
		case T91: //  'null'
		case RULE_NUMBER:
		case RULE_HEX_NUMBER:
		case RULE_STRING:

		// member access ending
		case T21: // ']'

		// function call or nested expression ending
		case T15: // ')'
			return false;

		// otherwise OK
		default:
			return true;
		}
        }

	@Override
	public Token nextToken() {
		Token result = super.nextToken();
		if (!isHiddenToken(result))
			last = result;
		return result;
	}

	public boolean isHiddenToken(Token t) {
		int type = t.getType();
		return type == RULE_WS | type == RULE_ML_COMMENT
				| type == RULE_SL_COMMENT;
	}

The constraint is to know which number identifies which token (this can be found in myPackage.parser.antlr.internal.InternalMyDsl.tokens). So this hack must be updated at each grammar modification...

Then, you override mTokens(). Every time the ambiguous rule is called, use the methods above to decide which rule to call. I wrote :

        if (areRegexEnabled())
		mRULE_REGEX(); // rule for regular expressions
	else
		mT69(); // Rule for divide operator '/'

I know this is not a proper way to do that, but it works...
When I have more time, I will write an Xpand postprocessor to modify the generated lexer automatically, as suggested in another post. So I would just have to change mRULE_REGEX() instead of mTokens().

Cheers,

[Updated on: Tue, 04 January 2011 11:14] by Moderator

Previous Topic:	AltGr and Ctrl key interact with content assist
Next Topic:	Outline View not working in Xtext 1.0.1?

Goto Forum:

-=] Back to Top [=-

Current Time: Wed Jul 02 19:33:15 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter