Eclipse Community Forums: TMF (Xtext) » How to define a greedy version of Xtext until (->)?

Help

Home

Home » Modeling » TMF (Xtext) » How to define a greedy version of Xtext until (->)?

Show: Today's Messages :: Show Polls :: Message Navigator

How to define a greedy version of Xtext until (->)? [message #1781919]

Wed, 14 February 2018 15:57

Nicolas Rouquette

Messages: 40
Registered: December 2014

Member

Xtext's until operator is useful for consuming everything between two tokens, for example, Scala-like raw strings:

terminal RAW_STRING_VALUE returns RawStringDataType: '"""' -> '"""';

Unfortunately, this is a non-greedy lexer rule as seen in the generated
Antlr rule:

RULE_RAW_STRING_VALUE : '"""' ( options {greedy=false;} : . )*'"""';

And it's non-greedy as seen in the generated lexer java logic:

    // $ANTLR start "RULE_RAW_STRING_VALUE"
    public final void mRULE_RAW_STRING_VALUE() throws RecognitionException {
        try {
            int _type = RULE_RAW_STRING_VALUE;
            int _channel = DEFAULT_TOKEN_CHANNEL;
            // InternalOML.g:9081:23: ( '\"\"\"' ( options {greedy=false; } : . )* '\"\"\"' )
            // InternalOML.g:9081:25: '\"\"\"' ( options {greedy=false; } : . )* '\"\"\"'
            {
            match("\"\"\""); 

            // InternalOML.g:9081:31: ( options {greedy=false; } : . )*
            loop36:
            do {
                int alt36=2;
                int LA36_0 = input.LA(1);

                if ( (LA36_0=='\"') ) {
                    int LA36_1 = input.LA(2);

                    if ( (LA36_1=='\"') ) {
                        int LA36_3 = input.LA(3);

                        if ( (LA36_3=='\"') ) {
                            alt36=2;
                        }
                        else if ( ((LA36_3>='\u0000' && LA36_3<='!')||(LA36_3>='#' && LA36_3<='\uFFFF')) ) {
                            alt36=1;
                        }


                    }
                    else if ( ((LA36_1>='\u0000' && LA36_1<='!')||(LA36_1>='#' && LA36_1<='\uFFFF')) ) {
                        alt36=1;
                    }


                }
                else if ( ((LA36_0>='\u0000' && LA36_0<='!')||(LA36_0>='#' && LA36_0<='\uFFFF')) ) {
                    alt36=1;
                }


                switch (alt36) {
            	case 1 :
            	    // InternalOML.g:9081:59: .
            	    {
            	    matchAny(); 

            	    }
            	    break;

            	default :
            	    break loop36;
                }
            } while (true);

            match("\"\"\""); 


            }

            state.type = _type;
            state.channel = _channel;
        }
        finally {
        }
    }
    // $ANTLR end "RULE_RAW_STRING_VALUE"

Non-greedy is unfortunately not what's intuitively expected.
For example, the following ought to be legal raw strings:

"""1""""
"""2"""""
"""3""""""
"""4"""""""

These ought to in such a way that the contents of the RawString should be:

1"
2""
3"""
4""""

That's not what happens; in fact, the whole thing fails to lex properly.
I would like a greedy version of the until operator, perhaps something like this:

terminal RAW_STRING_VALUE returns RawStringDataType: '"""' ->* '"""';

The idea would be to generate a greedy version of the Antlr grammar rule
somehow such that the lexer java code produced would be functionally
equivalent to the following:


    // $ANTLR start "RULE_RAW_STRING_VALUE"
    public final void mRULE_RAW_STRING_VALUE() throws RecognitionException {
        try {
            int _type = RULE_RAW_STRING_VALUE;
            int _channel = DEFAULT_TOKEN_CHANNEL;
            // InternalOML.g:9081:23: ( '\"\"\"' ( options {greedy=true; } : . )* '\"\"\"' )
            // InternalOML.g:9081:25: '\"\"\"' ( options {greedy=true; } : . )* '\"\"\"'
            {
            match("\"\"\""); 

            // InternalOML.g:9081:31: ( options {greedy=true; } : . )*
            loop36:
            do {
                int alt36=2;
                int LA36_0 = input.LA(1);

                if ( (LA36_0=='\"') ) {
                    int LA36_1 = input.LA(2);

                    if ( (LA36_1=='\"') ) {
                        int LA36_3 = input.LA(3);

                        if ( (LA36_3=='\"') ) {
                                  // greedy version of ->
                        	  int LA36_4 = input.LA(4);
                        	  if ( CharStream.EOF == LA36_4 || LA36_4 != '\"') {
                            alt36=2;
                        	  } else {
                        	   alt36=1;
                        	  }
                        }
                        else if ( ((LA36_3>='\u0000' && LA36_3<='!')||(LA36_3>='#' && LA36_3<='\uFFFF')) ) {
                            alt36=1;
                        }


                    }
                    else if ( ((LA36_1>='\u0000' && LA36_1<='!')||(LA36_1>='#' && LA36_1<='\uFFFF')) ) {
                        alt36=1;
                    }


                }
                else if ( ((LA36_0>='\u0000' && LA36_0<='!')||(LA36_0>='#' && LA36_0<='\uFFFF')) ) {
                    alt36=1;
                }


                switch (alt36) {
            	case 1 :
            	    // InternalOML.g:9081:59: .
            	    {
            	    matchAny(); 

            	    }
            	    break;

            	default :
            	    break loop36;
                }
            } while (true);

            match("\"\"\""); 


            }

            state.type = _type;
            state.channel = _channel;
        }
        finally {
        }
    }
    // $ANTLR end "RULE_RAW_STRING_VALUE"

This problem isn't unique to my language.
In fact, it affects Xtend as shown below:

// Xtend code fails to parse!
class raw {
	
  static val foo1 = '''1''''
  static val foo2 = '''2'''''
  static val foo3 = '''3''''''
  static val foo4 = '''4'''''''
	
  def static void main(String[] args) {
  	println(foo1)
  	println(foo2)
  	println(foo3)
  	println(foo4)
  }
}

Scala as a similar syntax for raw strings; however, unlike Xtend's non-greedy raw strings, Scala's raw strings are greedy; e.g.:

object raw {

  val foo1 = """1""""
  val foo2 = """2"""""
  val foo3 = """3""""""
  val foo4 = """4"""""""

  def main(args: Array[String]): Unit = {
    System.out.println(foo1)
    System.out.println(foo2)
    System.out.println(foo3)
    System.out.println(foo4)
  }
}

when run, this produces:

1"
2""
3"""
4""""

Unless I missed something, the current Xtext 2.12 or 2.13 language doesn't provide a way to define a greedy until lexer rule as shown above. Am I correct about this?

If so, is it reasonable to ask for a new feature to support greedy until in xtext?

- Nicolas.

Report message to a moderator

Re: How to define a greedy version of Xtext until (->)? [message #1781921 is a reply to message #1781919]

Wed, 14 February 2018 16:08

Christian Dietrich

Messages: 14668
Registered: July 2009

Senior Member

sounds like a usecase for a manually written lexer e.g. https://typefox.io/taming-the-lexer

Twitter : @chrdietrich
Blog : https://www.dietrich-it.de

Report message to a moderator

Re: How to define a greedy version of Xtext until (->)? [message #1781926 is a reply to message #1781921]

Wed, 14 February 2018 16:32

Christian Dietrich

Messages: 14668
Registered: July 2009

Senior Member

or you try to tame the lexer generation

Workflow {
	
	component = XtextGenerator {
		configuration = CustomGeneratorModule {
			project = StandardProjectConfig {
......
		language = StandardLanguage {
			name = "org.xtext.example.mydsl2.MyDsl"
			fileExtensions = "mydsl2"

			serializer = {
				generateStub = false
			}
			validator = {
				// composedCheck = "org.eclipse.xtext.validation.NamesAreUniqueValidator"
			}
			parserGenerator = {
			    combinedGrammar = false
			}.....

import org.eclipse.xtext.xtext.generator.DefaultGeneratorModule;
import org.eclipse.xtext.xtext.generator.parser.antlr.AntlrGrammarGenerator;

public class CustomGeneratorModule extends DefaultGeneratorModule {
	
	public Class<? extends AntlrGrammarGenerator> bindAntlrGrammarGenerator() {
		return AntlrGrammarGenerator2.class;
	}

}

import org.eclipse.xtext.Grammar
import org.eclipse.xtext.xtext.generator.parser.antlr.AntlrGrammarGenerator
import org.eclipse.xtext.xtext.generator.parser.antlr.AntlrOptions

class AntlrGrammarGenerator2 extends AntlrGrammarGenerator {
    override protected CharSequence compileLexer(Grammar it, AntlrOptions options) {
        '''
        // custom
        «super.compileLexer(it, options)»
        '''
    }
}

Twitter : @chrdietrich
Blog : https://www.dietrich-it.de

Report message to a moderator

Re: How to define a greedy version of Xtext until (->)? [message #1781927 is a reply to message #1781921]

Wed, 14 February 2018 16:34

Nicolas Rouquette

Messages: 40
Registered: December 2014

Member

Thanks for the suggestion Christian! A better lexer machinery is indeed something I had thought of.

Report message to a moderator

Previous Topic:	generated ParsetreeConstructor file doesn't compile after upgrade from xtext 2.9.2 to xtext 2.13
Next Topic:	Tracking Init, Read, Write accesses to a field

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 26 19:11:28 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter