|
Re: Unicode [message #537816 is a reply to message #537737] |
Thu, 03 June 2010 17:05 |
Robert M. Fuhrer Messages: 294 Registered: July 2009 |
Senior Member |
|
|
On 6/3/10 10:05 AM, Magisterion wrote:
> How to add UTF8/Unicode symbols support?
This is really a scanner issue. If you're using LPG to build your scanner,
then you basically have to write your scanner rules so as to permit UTF-8
characters in the right places, which is probably in literal strings and
perhaps identifiers. I can't really speak for other kinds of lexers, since
I don't have experience using them for UTF-8.
Note that at this point, the LPG scanner driver doesn't understand the
"Byte Order Mark" that can appear at the beginning of UTF-8 byte streams
(which is used to identify what kind of UTF-8 stream it is, e.g., a UTF-8
encoding of UTF-16 little-endian, etc.). As a result, the caller of the
scanner has to be sure it's passing a kind of stream that the scanner is
prepared to handle. There's work in progress to enhance the LPG driver
and the runtime API in this area, so that (a) the driver recognizes self-
identifying UTF-8 streams with a BOM, and (b) the client can tell the
scanner programmatically what kind of encoding is being used in the
stream so that it knows how many bytes to read for each symbol, and with
what endianness.
--
Cheers,
-- Bob
--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center
IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
|
|
|
Re: Unicode [message #537957 is a reply to message #537816] |
Fri, 04 June 2010 10:24 |
Magisterion Messages: 10 Registered: June 2010 |
Junior Member |
|
|
Thanks, Robert! =)
But i need unicode just for comments. Because by default rules - hen i type comment like "//...unicode symbols" - comment ends on that symbolse.
I'm solved this problem with the next solution:
notEOL ::= letter | digit | special | Space | HT | FF | AfterASCII
[Updated on: Fri, 04 June 2010 10:24] Report message to a moderator
|
|
|
|
|
Re: Unicode [message #577891 is a reply to message #537737] |
Thu, 03 June 2010 17:05 |
Robert M. Fuhrer Messages: 294 Registered: July 2009 |
Senior Member |
|
|
On 6/3/10 10:05 AM, Magisterion wrote:
> How to add UTF8/Unicode symbols support?
This is really a scanner issue. If you're using LPG to build your scanner,
then you basically have to write your scanner rules so as to permit UTF-8
characters in the right places, which is probably in literal strings and
perhaps identifiers. I can't really speak for other kinds of lexers, since
I don't have experience using them for UTF-8.
Note that at this point, the LPG scanner driver doesn't understand the
"Byte Order Mark" that can appear at the beginning of UTF-8 byte streams
(which is used to identify what kind of UTF-8 stream it is, e.g., a UTF-8
encoding of UTF-16 little-endian, etc.). As a result, the caller of the
scanner has to be sure it's passing a kind of stream that the scanner is
prepared to handle. There's work in progress to enhance the LPG driver
and the runtime API in this area, so that (a) the driver recognizes self-
identifying UTF-8 streams with a BOM, and (b) the client can tell the
scanner programmatically what kind of encoding is being used in the
stream so that it knows how many bytes to read for each symbol, and with
what endianness.
--
Cheers,
-- Bob
--------------------------------
Robert M. Fuhrer
Research Staff Member
Programming Technologies Dept.
IBM T.J. Watson Research Center
IDE Meta-tooling Platform Project Lead (http://www.eclipse.org/imp)
X10: Productive High-Performance Parallel Programming (http://x10.sf.net)
|
|
|
|
|
|
Powered by
FUDForum. Page generated in 0.06694 seconds