Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #694077] |
Thu, 07 July 2011 18:54 |
Ajit Dingankar Messages: 12 Registered: July 2011 |
Junior Member |
|
|
I tried to modify the "Greetings" example by adding a single Unicode character
rule between 'Hello' and name, similar to the posting 651440 in this forum
(can't post links since this is my first post
I got the Java runtime exception "Invalid XML character (Unicode: 0x0)"
I found a similar bug (id=319822) for character 0x8:
The first comment there says: "The easiest way to fix this is to save the
grammar's XMI in XML 1.1." So I tried that in my MWE2 work-flow, even though
Comment 4 mentions problems on "Linux/Sun-JDK-1.6.0_17, reading in
XML-1.1" since I'm using OpenJDK 1.6.0_22 (on 32-bit Linux i386/i686). I've
stepped through GrammarAccessFragment.generate() to make sure that the XML
version for the XMLResource is set to "1.1" before save() is called.
Digging a bit deeper, it looks like after converting the special characters in
XMLSaveImpl, the checks for isValid and isHighSurrogate (for range D800-DBFF)
fail, triggering the exception.
The error disappers when I change the rule to something like:
terminal WORD: '\uFFC0'..'\uFFDF';
I can push the higher end of the range to FFFD, but it fails on FFFE (and FFFF).
The initialization of character flag array CHARS explains this (the last sub-
range to be initialized to non-zero values is E000-FFDF.
Suggestion for a possible solution: In XMLSaveImpl, remove the check for inValid
XML chars at line 3367 (only check for high surrogates lines 3369-3397) and also
the exception at line 3400. The else clause will merge into the new code that
already handles proper encoding support.
Any help will be greatly appreciated!
Thanks,
Ajit
====
--- MWE2 workflow snippet---
fragment = grammarAccess.GrammarAccessFragment {
xmlVersion = "1.1"
}
--- Grammar ---
grammar org.xtext.example.mydsl.MyDsl // with org.eclipse.xtext.common.Terminals
// Have to replicate the common.Terminals grammar except the ANY_OTHER rule
// since it hides WORD matching rule
hidden(WS, ML_COMMENT, SL_COMMENT)
generate myDsl ...
import ... as ecore
Model:
greetings+=Greeting*;
Greeting:
'Hello' badword=WORD name=ID '!';
terminal WORD: '\u0000'..'\uFFFF';
terminal ID : '^'?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal STRING :
'"' ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|'"') )* '"' |
"'" ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|"'") )* "'"
;
terminal ML_COMMENT : '/*' -> '*/';
terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?;
terminal WS : (' '|'\t'|'\r'|'\n')+;
|
|
|
Re: Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #694142 is a reply to message #694077] |
Thu, 07 July 2011 22:03 |
Ed Merks Messages: 33140 Registered: July 2009 |
Senior Member |
|
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
If you look at <a class="moz-txt-link-freetext" href="http://www.w3.org/TR/xml/#charsets">http://www.w3.org/TR/xml/#charsets</a> you'll see it
specifies these as valid XML characters:<br>
<br>
<table class="scrap" summary="Scrap">
<tbody>
<tr valign="baseline">
<td><code>Char</code></td>
<td> ::= </td>
<td><code>#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]</code></td>
<td><i>/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */</i></td>
</tr>
</tbody>
</table>
<br>
Note that it says nothing about how they're encoded. As such,
something like #x0 is invalid even if you encode it as an entity.
Of course XML 1.1 extends this to allow some additional ASCII
characters in the lower range, but it appears to me that FFFE and
FFFF are not valid XML characters ever, even for XML 1.1, and under
no circumstance should we be serializing that into an XML file
because an XML processor will not be able to read it.<br>
<br>
It all makes me wonder what you're trying to accomplish by
specifying WORD the way you have...<br>
<br>
<br>
On 07/07/2011 11:54 AM, Ajit Dingankar wrote:
<blockquote cite="mid:iv4ur9$vft$1@news.eclipse.org" type="cite">I
tried to modify the "Greetings" example by adding a single Unicode
character rule between 'Hello' and name, similar to the posting
651440 in this forum (can't post links since this is my first post
;) <br>
I got the Java runtime exception "Invalid XML character (Unicode:
0x0)" I found a similar bug (id=319822) for character 0x8: The
first comment there says: "The easiest way to fix this is to save
the grammar's XMI in XML 1.1." So I tried that in my MWE2
work-flow, even though Comment 4 mentions problems on
"Linux/Sun-JDK-1.6.0_17, reading in XML-1.1" since I'm using
OpenJDK 1.6.0_22 (on 32-bit Linux i386/i686). I've stepped through
GrammarAccessFragment.generate() to make sure that the XML version
for the XMLResource is set to "1.1" before save() is called. <br>
Digging a bit deeper, it looks like after converting the special
characters in XMLSaveImpl, the checks for isValid and
isHighSurrogate (for range D800-DBFF) fail, triggering the
exception. <br>
The error disappers when I change the rule to something like:
terminal WORD: '\uFFC0'..'\uFFDF';
<br>
I can push the higher end of the range to FFFD, but it fails on
FFFE (and FFFF). The initialization of character flag array CHARS
explains this (the last sub-
<br>
range to be initialized to non-zero values is E000-FFDF. <br>
Suggestion for a possible solution: In XMLSaveImpl, remove the
check for inValid XML chars at line 3367 (only check for high
surrogates lines 3369-3397) and also the exception at line 3400.
The else clause will merge into the new code that already handles
proper encoding support. <br>
Any help will be greatly appreciated! <br>
Thanks, Ajit
<br>
==== <br>
--- MWE2 workflow snippet--- fragment =
grammarAccess.GrammarAccessFragment {
<br>
xmlVersion = "1.1"
<br>
}
<br>
--- Grammar --- grammar org.xtext.example.mydsl.MyDsl // with
org.eclipse.xtext.common.Terminals
<br>
// Have to replicate the common.Terminals grammar except the
ANY_OTHER rule // since it hides WORD matching rule hidden(WS,
ML_COMMENT, SL_COMMENT)
<br>
<br>
generate myDsl ...
<br>
<br>
import ... as ecore
<br>
<br>
Model:
<br>
greetings+=Greeting*;
<br>
<br>
Greeting:
<br>
'Hello' badword=WORD name=ID '!';
<br>
<br>
terminal WORD: '\u0000'..'\uFFFF';
<br>
<br>
terminal ID : '^'?('a'..'z'|'A'..'Z'|'_')
('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
<br>
terminal INT returns ecore::EInt: ('0'..'9')+;
<br>
terminal STRING : '"' ( '\\'
('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|'"') )* '"' |
<br>
"'" ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') |
!('\\'|"'") )* "'"
<br>
; terminal ML_COMMENT : '/*' -> '*/';
<br>
terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?;
<br>
<br>
terminal WS : (' '|'\t'|'\r'|'\n')+;
<br>
<br>
<br>
<br>
</blockquote>
</body>
</html>
Ed Merks
Professional Support: https://www.macromodeling.com/
|
|
|
|
|
Powered by
FUDForum. Page generated in 0.03283 seconds