Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Archived » IMP » Unicode
Unicode [message #537737] Thu, 03 June 2010 14:05 Go to next message
Magisterion is currently offline MagisterionFriend
Messages: 10
Registered: June 2010
Junior Member
How to add UTF8/Unicode symbols support?
Re: Unicode [message #537816 is a reply to message #537737] Thu, 03 June 2010 17:05 Go to previous messageGo to next message
Robert M. Fuhrer is currently offline Robert M. FuhrerFriend
Messages: 294
Registered: July 2009
Senior Member
On 6/3/10 10:05 AM, Magisterion wrote:
> How to add UTF8/Unicode symbols support?

This is really a scanner issue. If you're using LPG to build your scanner,
then you basically have to write your scanner rules so as to permit UTF-8
characters in the right places, which is probably in literal strings and
perhaps identifiers. I can't really speak for other kinds of lexers, since
I don't have experience using them for UTF-8.

Note that at this point, the LPG scanner driver doesn't understand the
"Byte Order Mark" that can appear at the beginning of UTF-8 byte streams
(which is used to identify what kind of UTF-8 stream it is, e.g., a UTF-8
encoding of UTF-16 little-endian, etc.). As a result, the caller of the
scanner has to be sure it's passing a kind of stream that the scanner is
prepared to handle. There's work in progress to enhance the LPG driver
and the runtime API in this area, so that (a) the driver recognizes self-
identifying UTF-8 streams with a BOM, and (b) the client can tell the
scanner programmatically what kind of encoding is being used in the
stream so that it knows how many bytes to read for each symbol, and with
what endianness.

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
Re: Unicode [message #537957 is a reply to message #537816] Fri, 04 June 2010 10:24 Go to previous messageGo to next message
Magisterion is currently offline MagisterionFriend
Messages: 10
Registered: June 2010
Junior Member
Thanks, Robert! =)
But i need unicode just for comments. Because by default rules - hen i type comment like "//...unicode symbols" - comment ends on that symbolse.
I'm solved this problem with the next solution:
notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII

[Updated on: Fri, 04 June 2010 10:24]

Report message to a moderator

Re: Unicode [message #538056 is a reply to message #537957] Fri, 04 June 2010 17:17 Go to previous messageGo to next message
Robert M. Fuhrer is currently offline Robert M. FuhrerFriend
Messages: 294
Registered: July 2009
Senior Member
On 6/4/10 6:24 AM, Magisterion wrote:
> Thanks, Robet! =)
> But i need unicode just for comments. Because by default rules - hen i
> type comment like "//...unicode symbols" - comment ends on that symbolse.
> I'm solved this problem with the next solution:
> notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII

Yes, that looks reasonable, and much like what I've seen in a couple of
other LPG scanner specs.

Glad to help!

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
Re: Unicode [message #548719 is a reply to message #537737] Thu, 22 July 2010 17:33 Go to previous message
Werner Keil is currently offline Werner KeilFriend
Messages: 1087
Registered: July 2009
Senior Member
Try Eclipse Babel (eclipse.technology.babel) or otherswise ICU4J, the
Unicode Framework by IBM and others used by Eclipse is also a good place to
look at.

Not many projects use it directly, our effort with UOMo probably the only
notable exception. Feel free to get in touch via "eclipse.uomo", probably
best to highlight "Unicode" in the subject.

HTH,
Werner
Re: Unicode [message #577891 is a reply to message #537737] Thu, 03 June 2010 17:05 Go to previous message
Robert M. Fuhrer is currently offline Robert M. FuhrerFriend
Messages: 294
Registered: July 2009
Senior Member
On 6/3/10 10:05 AM, Magisterion wrote:
> How to add UTF8/Unicode symbols support?

This is really a scanner issue. If you're using LPG to build your scanner,
then you basically have to write your scanner rules so as to permit UTF-8
characters in the right places, which is probably in literal strings and
perhaps identifiers. I can't really speak for other kinds of lexers, since
I don't have experience using them for UTF-8.

Note that at this point, the LPG scanner driver doesn't understand the
"Byte Order Mark" that can appear at the beginning of UTF-8 byte streams
(which is used to identify what kind of UTF-8 stream it is, e.g., a UTF-8
encoding of UTF-16 little-endian, etc.). As a result, the caller of the
scanner has to be sure it's passing a kind of stream that the scanner is
prepared to handle. There's work in progress to enhance the LPG driver
and the runtime API in this area, so that (a) the driver recognizes self-
identifying UTF-8 streams with a BOM, and (b) the client can tell the
scanner programmatically what kind of encoding is being used in the
stream so that it knows how many bytes to read for each symbol, and with
what endianness.

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
Re: Unicode [message #577953 is a reply to message #537816] Fri, 04 June 2010 10:24 Go to previous message
Magisterion is currently offline MagisterionFriend
Messages: 10
Registered: June 2010
Junior Member
Thanks, Robet! =)
But i need unicode just for comments. Because by default rules - hen i type comment like "//...unicode symbols" - comment ends on that symbolse.
I'm solved this problem with the next solution:
notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII
Re: Unicode [message #577982 is a reply to message #577953] Fri, 04 June 2010 17:17 Go to previous message
Robert M. Fuhrer is currently offline Robert M. FuhrerFriend
Messages: 294
Registered: July 2009
Senior Member
On 6/4/10 6:24 AM, Magisterion wrote:
> Thanks, Robet! =)
> But i need unicode just for comments. Because by default rules - hen i
> type comment like "//...unicode symbols" - comment ends on that symbolse.
> I'm solved this problem with the next solution:
> notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII

Yes, that looks reasonable, and much like what I've seen in a couple of
other LPG scanner specs.

Glad to help!

--
Cheers,
-- Bob

--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center

IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
Re: Unicode [message #578002 is a reply to message #537737] Thu, 22 July 2010 17:33 Go to previous message
Werner Keil is currently offline Werner KeilFriend
Messages: 1087
Registered: July 2009
Senior Member
Try Eclipse Babel (eclipse.technology.babel) or otherswise ICU4J, the
Unicode Framework by IBM and others used by Eclipse is also a good place to
look at.

Not many projects use it directly, our effort with UOMo probably the only
notable exception. Feel free to get in touch via "eclipse.uomo", probably
best to highlight "Unicode" in the subject.

HTH,
Werner
Previous Topic:Error initializing parser for input
Next Topic:Proposed API change to IEditorService
Goto Forum:
  


Current Time: Thu Mar 28 10:51:13 GMT 2024

Powered by FUDForum. Page generated in 0.04723 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top