Eclipse Community Forums: IMP

Help

Home

Home » Archived » IMP » Unicode

Show: Today's Messages :: Show Polls :: Message Navigator

Unicode [message #537737]

Thu, 03 June 2010 14:05

Magisterion

Messages: 10
Registered: June 2010

Junior Member

How to add UTF8/Unicode symbols support?

Report message to a moderator

Re: Unicode [message #537816 is a reply to message #537737]

Thu, 03 June 2010 17:05

Robert M. Fuhrer

Messages: 294
Registered: July 2009

Senior Member

On 6/3/10 10:05 AM, Magisterion wrote:
> How to add UTF8/Unicode symbols support?

This is really a scanner issue. If you're using LPG to build your scanner,
then you basically have to write your scanner rules so as to permit UTF-8
characters in the right places, which is probably in literal strings and
perhaps identifiers. I can't really speak for other kinds of lexers, since
I don't have experience using them for UTF-8.

Note that at this point, the LPG scanner driver doesn't understand the
"Byte Order Mark" that can appear at the beginning of UTF-8 byte streams
(which is used to identify what kind of UTF-8 stream it is, e.g., a UTF-8
encoding of UTF-16 little-endian, etc.). As a result, the caller of the
scanner has to be sure it's passing a kind of stream that the scanner is
prepared to handle. There's work in progress to enhance the LPG driver
and the runtime API in this area, so that (a) the driver recognizes self-
identifying UTF-8 streams with a BOM, and (b) the client can tell the
scanner programmatically what kind of encoding is being used in the
stream so that it knows how many bytes to read for each symbol, and with
what endianness.

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)

Report message to a moderator

Re: Unicode [message #537957 is a reply to message #537816]

Fri, 04 June 2010 10:24

Magisterion

Messages: 10
Registered: June 2010

Junior Member

[Updated on: Fri, 04 June 2010 10:24]

Report message to a moderator

Re: Unicode [message #538056 is a reply to message #537957]

Fri, 04 June 2010 17:17

Robert M. Fuhrer

Messages: 294
Registered: July 2009

Senior Member

On 6/4/10 6:24 AM, Magisterion wrote:
> Thanks, Robet! =)
> But i need unicode just for comments. Because by default rules - hen i
> type comment like "//...unicode symbols" - comment ends on that symbolse.
> I'm solved this problem with the next solution:
> notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII

Yes, that looks reasonable, and much like what I've seen in a couple of
other LPG scanner specs.

Glad to help!

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)

Report message to a moderator

Re: Unicode [message #548719 is a reply to message #537737]

Thu, 22 July 2010 17:33

Werner Keil

Messages: 1087
Registered: July 2009

Senior Member

Try Eclipse Babel (eclipse.technology.babel) or otherswise ICU4J, the
Unicode Framework by IBM and others used by Eclipse is also a good place to
look at.

Not many projects use it directly, our effort with UOMo probably the only
notable exception. Feel free to get in touch via "eclipse.uomo", probably
best to highlight "Unicode" in the subject.

HTH,
Werner

Report message to a moderator