Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » Matching Unicode category in terminal rules
Matching Unicode category in terminal rules [message #651440] Sun, 30 January 2011 22:12 Go to next message
Dénes Harmath is currently offline Dénes HarmathFriend
Messages: 157
Registered: July 2009
Senior Member
Hi all,

my DSL accepts all Unicode letters (i.e. with the Alphabetic property) in IDs. How could I formulate that as a lexer rule without enumerating all the proper character ranges spread throughout the Unicode planes? I know regexps aren't possible, but I thought of achieving what the \p{Alpha} regular expression does.

Thanks in advance,
thSoft
Re: Matching Unicode category in terminal rules [message #651443 is a reply to message #651440] Sun, 30 January 2011 23:36 Go to previous messageGo to next message
Dénes Harmath is currently offline Dénes HarmathFriend
Messages: 157
Registered: July 2009
Senior Member
Hmm, my DSL's specification wasn't correct: it accepts everything above \uA1, so simply
terminal ID: ("a".."z" | "A".."Z" | "¡".."ᅵ")+;
does the trick. Unfortunately, it seems characters above \uFFFF aren't supported, the generator signals the following error:

error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: syntax error: antlr: ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''

But I think this is not a practically significant issue. Smile
Re: Matching Unicode category in terminal rules [message #651468 is a reply to message #651443] Mon, 31 January 2011 09:24 Go to previous messageGo to next message
Jan Koehnlein is currently offline Jan KoehnleinFriend
Messages: 760
Registered: July 2009
Location: Hamburg
Senior Member
In Xtext 2.0, we now support unicode escapes in STRINGs, and thereby in
terminal rules and keywords. We also ship value converter that checks
for valid characters using Character helper methods.

I don't quite get your problem, there are no characters above \uffff in
16bit unicode, and AFAIK, even the last ones are invalid.

Am 31.01.11 00:36, schrieb Dennis Harmath:
> Hmm, my DSL's specification wasn't correct: it accepts everything above
> \uA1, so simply
> terminal ID: ("a".."z" | "A".."Z" | "¡".."ᅵ")+;
> does the trick. Unfortunately, it seems characters above \uFFFF aren't
> supported, the generator signals the following error:
>
> error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: syntax error: antlr:
> ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''
>
> But I think this is not a practically significant issue. :)


--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com


---
Get professional support from the Xtext committers at www.typefox.io
Re: Matching Unicode category in terminal rules [message #1854324 is a reply to message #651468] Sat, 13 August 2022 21:01 Go to previous messageGo to next message
Mirko Raner is currently offline Mirko RanerFriend
Messages: 125
Registered: July 2009
Location: New York City, NY
Senior Member
I think the OP's question was meant to be about avoiding code like this.

I am struggling with a similar issue where certain identifiers and operators need to comprise entire Unicode character categories (like Letter, Number, Math Symbol, etc.).

Currently, it appears that the only way to accommodate this in Xtext is by listing each individual character (or character range) in the category (as shown in the the linked source example). Most regex implementations, on the other hand, have long supported short-hand forms for identifying groups of characters by their Unicode category (e.g., \p{Lu} for uppercase letters, or \p{N} for numbers).

Is a shorthand syntax that refers to character groups by their Unicode category something that would be considered for a future version of Xtext? (I might be interested in contributing some code for this)
Re: Matching Unicode category in terminal rules [message #1854326 is a reply to message #1854324] Sun, 14 August 2022 07:33 Go to previous message
Christian Dietrich is currently offline Christian DietrichFriend
Messages: 14661
Registered: July 2009
Senior Member
@mirko, can you please open an issue at github.com/eclipse/xtext-core
i have doubts someone is reading here


Twitter : @chrdietrich
Blog : https://www.dietrich-it.de
Previous Topic:how is the qualified name calculated
Next Topic:Replacement for Xtend code generation
Goto Forum:
  


Current Time: Fri Mar 29 13:28:27 GMT 2024

Powered by FUDForum. Page generated in 0.02956 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top