Eclipse Community Forums: TMF (Xtext) » Parser rule for floating points with scientific notation

Help

Home

Home » Modeling » TMF (Xtext) » Parser rule for floating points with scientific notation

Show: Today's Messages :: Show Polls :: Message Navigator

Parser rule for floating points with scientific notation [message #1856600]

Fri, 16 December 2022 17:41

Simon Cockx

Messages: 69
Registered: October 2021

Member

I have been struggling to get my parser rule for scientific numbers exactly right.

It should support numbers such as

3.14
-3.14
+3.14
3.
.14
3.14e5
3.14E5
3.14e+5
3.14e-5

Extra conditions:
A) It should not allow spaces at any point (i.e., 3. 14 is not valid).

B) Strings such as E3 and e should be valid identifiers, i.e., they should not conflict with this rule in some way.

C) It should not conflict with a rule for integer ranges of the form '(' INT '..' INT ')'. Example: (5..42)

Attempt 1:

ScientificFloat hidden():
('+' | '-')? ('.' INT | INT '.' | INT '.' INT) (('e' | 'E') ('+' | '-')? INT)?
;
==>Failing case: 3.14e5
I think 'e5' is parsed as a single keyword, hence it fails.

Attempt 2:

ScientificFloat hidden():
('+' | '-')? ('.' INT | INT '.' | INT '.' INT) SCIENTIFIC?
;
terminal SCIENTIFIC:
('e' | 'E') ('+' | '-')? INT
;
==>Condition B fails: E3 is not a valid identifier because the lexer behaves differently.

Attempt 3:

terminal SCIENTIFIC_FLOAT:
('+' | '-')? ('.' INT | INT '.' | INT '.' INT) (('e' | 'E') ('+' | '-')? INT)?
;
==>Condition C fails: the integer range (5..42) is not valid anymore.

I'm out of guesses... Is there any way I can get it to behave exactly as I want?

[Updated on: Fri, 16 December 2022 17:42]

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856602 is a reply to message #1856600]

Fri, 16 December 2022 19:18

Christian Dietrich

Messages: 14665
Registered: July 2009

Senior Member

maybe you can check what others have done e.g. https://github.com/eclipse/n4js/blob/master/plugins/org.eclipse.n4js/src/org/eclipse/n4js/TypeExpressions.xtext

Twitter : @chrdietrich
Blog : https://www.dietrich-it.de

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856603 is a reply to message #1856602]

Fri, 16 December 2022 20:20

Ed Willink

Messages: 7655
Registered: July 2009

Senior Member

Hi

What you report should be no problem since '..' is a distinct token and so should be resolved by the lexer. If you really care about gratuitous spaces, you may need to play games with hiding.

But more likely you have the same problem as OCL where "." is also a binary navigation operator. This is a hard to handle syntactic ambiguity, but is relatively easy to handle lexically. The OCL parser therefore inserts a RetokenizingTokenSource between the standard Xtext lexer and parser to resolve the ambiguity lexically and so make the grammar easy.

See https://git.eclipse.org/r/plugins/gitiles/ocl/org.eclipse.ocl/+/refs/heads/master/plugins/org.eclipse.ocl.xtext.base/src/org/eclipse/ocl/xtext/base/services/RetokenizingTokenSource.java

Regards

Ed Willink

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856620 is a reply to message #1856600]

Sun, 18 December 2022 16:43

Simon Cockx

Messages: 69
Registered: October 2021

Member

Hi Ed

'..' is actually not resolved correctly by the lexer. In my "attempt 3", the new terminal rule is conflicting with the (INT..INT) rule. Example: with the new terminal rule, a string (5..42) is tokenized as
'(', '5.', '.42', and ')'
instead of
'(', '5', '..', '42', and ')'
so the '..' keyword is never recognized.

Thanks for the link, that looks interesting. I'll take a look at it on Monday.

Regards

Simon

[Updated on: Sun, 18 December 2022 23:05]

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856622 is a reply to message #1856620]

Sun, 18 December 2022 22:53

Simon Cockx

Messages: 69
Registered: October 2021

Member

@Christian Dietrich, thanks for the link. Their code is basically this:

terminal DOUBLE returns ecore::EBigDecimal:
	'.' DECIMAL_DIGIT_FRAGMENT+ EXPONENT_PART?
	| DECIMAL_INTEGER_LITERAL_FRAGMENT '.' DECIMAL_DIGIT_FRAGMENT* EXPONENT_PART?
;

terminal fragment EXPONENT_PART:
	  ('e' | 'E') SIGNED_INT
;

terminal fragment SIGNED_INT:
	('+' | '-') DECIMAL_DIGIT_FRAGMENT+
;

terminal fragment DECIMAL_INTEGER_LITERAL_FRAGMENT:
	'0'
	| '1'..'9' DECIMAL_DIGIT_FRAGMENT*
;
terminal fragment DECIMAL_DIGIT_FRAGMENT:
	'0'..'9'
;

Observations:
1. The sign of their exponent is mandatory, and therefore they do not have the problem that I had in my "attempt 1". I would like to make it optional, just like in Java.
2. Even more important; I think they suffer from the same problem as I have in attempt 3: the '..' keyword in an integer range would conflict with it. This rule makes the lexer tokenize it differently, e.g., instead of lexing (5..42) into '(', '5', '..', '42' and ')', it is now lexed into '(', '5.', '.42' and ')', so the '..' keyword effectively disappears.

So... still the same problem. I wonder how Java does it. (although Java doesn't require syntax similar to my integer ranges, so it could just not care about that)

[Updated on: Mon, 19 December 2022 23:21]

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856623 is a reply to message #1856622]

Sun, 18 December 2022 23:18

Simon Cockx

Messages: 69
Registered: October 2021

Member

Damn, I thought I cracked it with this rule:

ScientificFloat hidden():
	('+' | '-')? (
		('.' INT | INT '.' | INT '.' INT)
		| FLOAT_WITH_EXPONENT
	)
;

terminal FLOAT_WITH_EXPONENT:
    ('.' INT | INT '.' | INT '.' INT) ('e' | 'E') ('+' | '-')? ('0'..'9')+
;

Notice that in the terminal rule the exponent is mandatory, so my theory was that the lexer would not tokenize '5..42' into '5.' and '.42'.

But this apparently again clashes with my integer ranges somehow. I'm getting a "no viable alternative at character '..'"

I have no idea why.

[Updated on: Sun, 18 December 2022 23:22]

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856624 is a reply to message #1856623]

Mon, 19 December 2022 06:21

Christian Dietrich

Messages: 14665
Registered: July 2009

Senior Member

I can't do this for you. Maybe time to play around with. Lexer replacement like jflex

Please also Noten that order of terminals
.. is keyword and thus terminal
Matters too

Twitter : @chrdietrich
Blog : https://www.dietrich-it.de

[Updated on: Mon, 19 December 2022 06:33]

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856628 is a reply to message #1856624]

Mon, 19 December 2022 13:49

Ed Willink

Messages: 7655
Registered: July 2009

Senior Member

Hi

It could well be that there is an ANTLR backtracking bug. Certainly I was a bit baffled as to why I couldn't get it working for OCL.

Switching to jflex might help since it's greedy regex should eat the floating point literal in exactly the same way as OCL's RetokenizingTokenSource does.

Switching to another underlying technology such as LPG has been on my to-investigate list for a long time; I expect to see a ten-fold improvement in parse speed and an ability for incremental update in the editor. I'm unclear how well an alternative will integrate. You will surely find that another technology's Token is different to Xtext/ANTLr's Token and so you will need to create an Xtext-compatible Token to wrap the jflex Token. Might well be a week's work, whereas the RetokenizingTokenSource is probably only a day's work.

Regards

Ed Willink

Report message to a moderator

Re: Parser rule for floating points with scientific notation [message #1856631 is a reply to message #1856628]

Mon, 19 December 2022 20:14

Simon Cockx

Messages: 69
Registered: October 2021

Member

Ed, Christian

Thank you for the links and info. It has been really helpful to understand what's going on.

Since I'm not permitted to spent too much time on this, I will make a compromise and disallow using 'e' and 'E' as an identifier for now. Hopefully I can get back to this when I have the opportunity to dive deeper into lexer replacements.

Regards

Simon

[Updated on: Mon, 19 December 2022 23:20]

Report message to a moderator

Previous Topic:	Xtend IDE 2.21 editor doesn't open and can't build workspace
Next Topic:	Only allow explicit imports

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Tue Apr 23 09:03:31 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter