Eclipse Community Forums: TMF (Xtext) » Matching Unicode category in terminal rules

Help

Home

Home » Modeling » TMF (Xtext) » Matching Unicode category in terminal rules

Show: Today's Messages :: Show Polls :: Message Navigator

Matching Unicode category in terminal rules [message #651440]

Sun, 30 January 2011 22:12

Dénes Harmath

Messages: 157
Registered: July 2009

Senior Member

Hi all,

my DSL accepts all Unicode letters (i.e. with the Alphabetic property) in IDs. How could I formulate that as a lexer rule without enumerating all the proper character ranges spread throughout the Unicode planes? I know regexps aren't possible, but I thought of achieving what the \p{Alpha} regular expression does.

Thanks in advance,
thSoft

Report message to a moderator

Re: Matching Unicode category in terminal rules [message #651443 is a reply to message #651440]

Sun, 30 January 2011 23:36

Dénes Harmath

Messages: 157
Registered: July 2009

Senior Member

Hmm, my DSL's specification wasn't correct: it accepts everything above \uA1, so simply
terminal ID: ("a".."z" | "A".."Z" | "¡".."ￜ")+;
does the trick. Unfortunately, it seems characters above \uFFFF aren't supported, the generator signals the following error:

error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: syntax error: antlr: ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''

But I think this is not a practically significant issue. Smile

Report message to a moderator

Re: Matching Unicode category in terminal rules [message #651468 is a reply to message #651443]

Mon, 31 January 2011 09:24

Jan Koehnlein

Messages: 760
Registered: July 2009
Location: Hamburg

Senior Member

In Xtext 2.0, we now support unicode escapes in STRINGs, and thereby in
terminal rules and keywords. We also ship value converter that checks
for valid characters using Character helper methods.

I don't quite get your problem, there are no characters above \uffff in
16bit unicode, and AFAIK, even the last ones are invalid.

Am 31.01.11 00:36, schrieb Dennis Harmath:
> Hmm, my DSL's specification wasn't correct: it accepts everything above
> \uA1, so simply
> terminal ID: ("a".."z" | "A".."Z" | "¡".."ￜ")+;
> does the trick. Unfortunately, it seems characters above \uFFFF aren't
> supported, the generator signals the following error:
>
> error(100): ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: syntax error: antlr:
> ../org.elysium/src-gen/org/elysium/parser/antlr/lexer/Intern
> alLilyPond.g:205:40: expecting CHAR_LITERAL, found ''\uD800\uDC00''
>
> But I think this is not a practically significant issue. :)

--
Need professional support for Eclipse Modeling?
Go visit: http://xtext.itemis.com

---
Get professional support from the Xtext committers at www.typefox.io

Report message to a moderator

Re: Matching Unicode category in terminal rules [message #1854324 is a reply to message #651468]

Sat, 13 August 2022 21:01

Mirko Raner

Messages: 125
Registered: July 2009
Location: New York City, NY

Senior Member

I think the OP's question was meant to be about avoiding code like this.

I am struggling with a similar issue where certain identifiers and operators need to comprise entire Unicode character categories (like Letter, Number, Math Symbol, etc.).

Currently, it appears that the only way to accommodate this in Xtext is by listing each individual character (or character range) in the category (as shown in the the linked source example). Most regex implementations, on the other hand, have long supported short-hand forms for identifying groups of characters by their Unicode category (e.g., \p{Lu} for uppercase letters, or \p{N} for numbers).

Is a shorthand syntax that refers to character groups by their Unicode category something that would be considered for a future version of Xtext? (I might be interested in contributing some code for this)

Report message to a moderator