Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » TMF (Xtext) » Unicode beyond U+FFFF accepted in DSL - how to?(How to make the lexer accept terminals containing Unicode chars beyond FFFF.)
Unicode beyond U+FFFF accepted in DSL - how to? [message #1767100] Fri, 30 June 2017 22:26 Go to next message
David BlackFriend
Messages: 33
Registered: June 2017
Member
Hi,

I need certain Unicode chars to be admitted in identifiers of my DSL. Stuff like mathematical symbols and the like.

XText does support Unicode terminals, but it seems they are restricted to Unicode - Basic Multilanguage Plane (up to 0xFFFF, if I remember well). Unfortunately, the characters needed in my problem domain are beyond that, most of them being U+XXXXX (5-byte code chars).

Can anyone please suggest how shall I overcome this? I can consider any type of hack, because unfortunately I can't implement my DSL without this feature.

Thanks!

David
Re: Unicode beyond U+FFFF accepted in DSL - how to? [message #1767107 is a reply to message #1767100] Sat, 01 July 2017 07:37 Go to previous messageGo to next message
Ed Willink is currently offline Ed WillinkFriend
Messages: 7655
Registered: July 2009
Senior Member
Hi

Most Java applications just use char and String mindlessly and so support full Unicode automatically. Problems arise when external multi-byte encodings are decoded to chars. The Xtext Lexer uses short, but probably only for transition ids. The ANTLR parser uses tokens and so should be blind to encoding. I therefore suggest you review the properties of your source file to see whether your characters ever get read correctly. Then debug the initial conversion to confirm that your correct file type is respected. Probably your file is wrong, perhaps a simple bug needs fixing in the initial Xtext reader.

Regards

Ed Willink
Re: Unicode beyond U+FFFF accepted in DSL - how to? [message #1767116 is a reply to message #1767107] Sat, 01 July 2017 13:01 Go to previous messageGo to next message
David BlackFriend
Messages: 33
Registered: June 2017
Member
The symptoms are that XText's ANTLR rejects constants such as 'u\000FFF' in my .xtext file.

I've found the cause. It's that ANTLR versions prior to 4.7 only support a part of the Unicode standard, as they state here:

https://github.com/antlr/antlr4/blob/master/doc/unicode.md

Since t ilooks XText and ANTL4 don't play along together, it's game over for me an XText.

As a side note, I cannot say I regret it much. XText looks powerful, but I find the documentation severely flawed and I feel it's not the easiest toolkit to begin playing with parsers for people with little or no experience in this area.

@Ed, thanks for your help so far!

David
Re: Unicode beyond U+FFFF accepted in DSL - how to? [message #1767120 is a reply to message #1767116] Sat, 01 July 2017 13:32 Go to previous messageGo to next message
Ed Willink is currently offline Ed WillinkFriend
Messages: 7655
Registered: July 2009
Senior Member
Hi

I wouldn't give up that quick. In comparison to many Open Source tools, Xtext documentation is pretty good. There are many examples and a community of shared grammars. There are many users who have been successful with Xtext.

It seems that you've found the problem quite quickly. If there is a slightly different way in which the constants could be passed to ANTLR then it may just require a minor Xtext2ANTLR serialization tweak.

But you mention ANTLR >= 4.7. Xext solving the problem. Currently Xext still uses ANTLR 3.2. ANTLR 4.5.1 is available from Orbit so XText should really upgrade to 4.5.1. Beyond that requires a little admin effort to get ANTLR 4.7 approved for inclusion in Orbit.

A quick Google suggests that the CodePointCharStream is the magic fix. Perhaps you could contrive to use a clone in the current code.

Regards

Ed Willink
Re: Unicode beyond U+FFFF accepted in DSL - how to? [message #1767124 is a reply to message #1767120] Sat, 01 July 2017 14:07 Go to previous messageGo to next message
Christian Dietrich is currently offline Christian DietrichFriend
Messages: 14661
Registered: July 2009
Senior Member
the problem with antlr4: big ton on work.
problem with big ton of work: nobody todo it.
if you want to take over ...


Twitter : @chrdietrich
Blog : https://www.dietrich-it.de
Re: Unicode beyond U+FFFF accepted in DSL - how to? [message #1767126 is a reply to message #1767124] Sat, 01 July 2017 14:49 Go to previous message
Ed Willink is currently offline Ed WillinkFriend
Messages: 7655
Registered: July 2009
Senior Member
Hi

My intuitive comment was that ANTLR should be using tokens and so should not care about Unicode at all. However you observed that 'u\000FFF' doesn't work. I don't understand ANTLR at all, but I wonder if the problems you are seeing are in some debug support text strings rather than the primary functionality. You might be able to tweak what is presented for debugging.

Regards

Ed Willink

Previous Topic:Xtext Generator Chaining
Next Topic:InMemory Language Server Protocol?
Goto Forum:
  


Current Time: Tue Mar 19 10:46:45 GMT 2024

Powered by FUDForum. Page generated in 0.01588 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top