Eclipse Community Forums: TMF (Xtext) » Matching Unicode category in terminal rules

Home » Modeling » TMF (Xtext) » Matching Unicode category in terminal rules

Matching Unicode category in terminal rules [message #651440]

Sun, 30 January 2011 17:12

Eclipse User

Hi all,

my DSL accepts all Unicode letters (i.e. with the Alphabetic property) in IDs. How could I formulate that as a lexer rule without enumerating all the proper character ranges spread throughout the Unicode planes? I know regexps aren't possible, but I thought of achieving what the \p{Alpha} regular expression does.

Thanks in advance,
thSoft

Re: Matching Unicode category in terminal rules [message #651443 is a reply to message #651440]

Sun, 30 January 2011 18:36

Eclipse User

Hmm, my DSL's specification wasn't correct: it accepts everything above \uA1, so simply
terminal ID: ("a".."z" | "A".."Z" | "¡".."ￜ")+;
does the trick. Unfortunately, it seems characters above \uFFFF aren't supported, the generator signals the following error:

error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: syntax error: antlr: ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''

But I think this is not a practically significant issue. Smile

Re: Matching Unicode category in terminal rules [message #651468 is a reply to message #651443]

Mon, 31 January 2011 04:24

Eclipse User

In Xtext 2.0, we now support unicode escapes in STRINGs, and thereby in
terminal rules and keywords. We also ship value converter that checks
for valid characters using Character helper methods.

I don't quite get your problem, there are no characters above \uffff in
16bit unicode, and AFAIK, even the last ones are invalid.

Am 31.01.11 00:36, schrieb Dennis Harmath:
> Hmm, my DSL's specification wasn't correct: it accepts everything above
> \uA1, so simply
> terminal ID: ("a".."z" | "A".."Z" | "¡".."ￜ")+;
> does the trick. Unfortunately, it seems characters above \uFFFF aren't
> supported, the generator signals the following error:
>
> error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: syntax error: antlr:
> ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''
>
> But I think this is not a practically significant issue. :)

--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

Re: Matching Unicode category in terminal rules [message #1854324 is a reply to message #651468]

Sat, 13 August 2022 17:01

Eclipse User

I think the OP's question was meant to be about avoiding code like this.

I am struggling with a similar issue where certain identifiers and operators need to comprise entire Unicode character categories (like Letter, Number, Math Symbol, etc.).

Currently, it appears that the only way to accommodate this in Xtext is by listing each individual character (or character range) in the category (as shown in the the linked source example). Most regex implementations, on the other hand, have long supported short-hand forms for identifying groups of characters by their Unicode category (e.g., \p{Lu} for uppercase letters, or \p{N} for numbers).

Is a shorthand syntax that refers to character groups by their Unicode category something that would be considered for a future version of Xtext? (I might be interested in contributing some code for this)

Re: Matching Unicode category in terminal rules [message #1854326 is a reply to message #1854324]

Sun, 14 August 2022 03:33

Eclipse User

@mirko, can you please open an issue at github.com/eclipse/xtext-core
i have doubts someone is reading here

Previous Topic:	how is the qualified name calculated
Next Topic:	Replacement for Xtend code generation

Goto Forum:

-=] Back to Top [=-

Current Time: Thu May 15 21:47:00 EDT 2025

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter