Eclipse Community Forums: TMF (Xtext) » Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2

Help

Home

Home » Modeling » TMF (Xtext) » Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2

Show: Today's Messages :: Show Polls :: Message Navigator

Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #694077]

Thu, 07 July 2011 18:54

Ajit Dingankar

Messages: 12
Registered: July 2011

Junior Member

I tried to modify the "Greetings" example by adding a single Unicode character
rule between 'Hello' and name, similar to the posting 651440 in this forum
(can't post links since this is my first post Wink

I got the Java runtime exception "Invalid XML character (Unicode: 0x0)"
I found a similar bug (id=319822) for character 0x8:
The first comment there says: "The easiest way to fix this is to save the
grammar's XMI in XML 1.1." So I tried that in my MWE2 work-flow, even though
Comment 4 mentions problems on "Linux/Sun-JDK-1.6.0_17, reading in
XML-1.1" since I'm using OpenJDK 1.6.0_22 (on 32-bit Linux i386/i686). I've
stepped through GrammarAccessFragment.generate() to make sure that the XML
version for the XMLResource is set to "1.1" before save() is called.

Digging a bit deeper, it looks like after converting the special characters in
XMLSaveImpl, the checks for isValid and isHighSurrogate (for range D800-DBFF)
fail, triggering the exception.

The error disappers when I change the rule to something like:
terminal WORD: '\uFFC0'..'\uFFDF';
I can push the higher end of the range to FFFD, but it fails on FFFE (and FFFF).
The initialization of character flag array CHARS explains this (the last sub-
range to be initialized to non-zero values is E000-FFDF.

Suggestion for a possible solution: In XMLSaveImpl, remove the check for inValid
XML chars at line 3367 (only check for high surrogates lines 3369-3397) and also
the exception at line 3400. The else clause will merge into the new code that
already handles proper encoding support.

Any help will be greatly appreciated!

Thanks,
Ajit
====

--- MWE2 workflow snippet---
fragment = grammarAccess.GrammarAccessFragment {
xmlVersion = "1.1"
}
--- Grammar ---
grammar org.xtext.example.mydsl.MyDsl // with org.eclipse.xtext.common.Terminals
// Have to replicate the common.Terminals grammar except the ANY_OTHER rule
// since it hides WORD matching rule
hidden(WS, ML_COMMENT, SL_COMMENT)

generate myDsl ...

import ... as ecore

Model:
greetings+=Greeting*;

Greeting:
'Hello' badword=WORD name=ID '!';

terminal WORD: '\u0000'..'\uFFFF';

terminal ID : '^'?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
terminal INT returns ecore::EInt: ('0'..'9')+;
terminal STRING :
'"' ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|'"') )* '"' |
"'" ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|"'") )* "'"
;
terminal ML_COMMENT : '/*' -> '*/';
terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?;

terminal WS : (' '|'\t'|'\r'|'\n')+;

Report message to a moderator

Re: Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #694142 is a reply to message #694077]

Thu, 07 July 2011 22:03

Ed Merks

Messages: 33141
Registered: July 2009

Senior Member

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
If you look at <a class="moz-txt-link-freetext" href="http://www.w3.org/TR/xml/#charsets">http://www.w3.org/TR/xml/#charsets</a> you'll see it
specifies these as valid XML characters: 
 
<table class="scrap" summary="Scrap">
<tbody>
<tr valign="baseline">
<td><code>Char</code></td>
<td> ::= </td>
<td><code>#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]</code></td>
<td>/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */</td>
</tr>
</tbody>
</table>
 
Note that it says nothing about how they're encoded. As such,
something like #x0 is invalid even if you encode it as an entity.
Of course XML 1.1 extends this to allow some additional ASCII
characters in the lower range, but it appears to me that FFFE and
FFFF are not valid XML characters ever, even for XML 1.1, and under
no circumstance should we be serializing that into an XML file
because an XML processor will not be able to read it. 
 
It all makes me wonder what you're trying to accomplish by
specifying WORD the way you have... 
 
 
On 07/07/2011 11:54 AM, Ajit Dingankar wrote:
<blockquote cite="mid:iv4ur9$vft$1@news.eclipse.org" type="cite">I
tried to modify the "Greetings" example by adding a single Unicode
character rule between 'Hello' and name, similar to the posting
651440 in this forum (can't post links since this is my first post
;) 
I got the Java runtime exception "Invalid XML character (Unicode:
0x0)" I found a similar bug (id=319822) for character 0x8: The
first comment there says: "The easiest way to fix this is to save
the grammar's XMI in XML 1.1." So I tried that in my MWE2
work-flow, even though Comment 4 mentions problems on
"Linux/Sun-JDK-1.6.0_17, reading in XML-1.1" since I'm using
OpenJDK 1.6.0_22 (on 32-bit Linux i386/i686). I've stepped through
GrammarAccessFragment.generate() to make sure that the XML version
for the XMLResource is set to "1.1" before save() is called. 
Digging a bit deeper, it looks like after converting the special
characters in XMLSaveImpl, the checks for isValid and
isHighSurrogate (for range D800-DBFF) fail, triggering the
exception. 
The error disappers when I change the rule to something like:
terminal WORD: '\uFFC0'..'\uFFDF';
 
I can push the higher end of the range to FFFD, but it fails on
FFFE (and FFFF). The initialization of character flag array CHARS
explains this (the last sub-
 
range to be initialized to non-zero values is E000-FFDF. 
Suggestion for a possible solution: In XMLSaveImpl, remove the
check for inValid XML chars at line 3367 (only check for high
surrogates lines 3369-3397) and also the exception at line 3400.
The else clause will merge into the new code that already handles
proper encoding support. 
Any help will be greatly appreciated! 
Thanks, Ajit
 
==== 
--- MWE2 workflow snippet--- fragment =
grammarAccess.GrammarAccessFragment {
 
xmlVersion = "1.1"
 
}
 
--- Grammar --- grammar org.xtext.example.mydsl.MyDsl // with
org.eclipse.xtext.common.Terminals
 
// Have to replicate the common.Terminals grammar except the
ANY_OTHER rule // since it hides WORD matching rule hidden(WS,
ML_COMMENT, SL_COMMENT)
 
 
generate myDsl ...
 
 
import ... as ecore
 
 
Model:
 
 greetings+=Greeting*;
 

 
Greeting:
 
 'Hello' badword=WORD name=ID '!';
 
 
terminal WORD: '\u0000'..'\uFFFF';
 
 
terminal ID : '^'?('a'..'z'|'A'..'Z'|'_')
('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;
 
terminal INT returns ecore::EInt: ('0'..'9')+;
 
terminal STRING : '"' ( '\\'
('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') | !('\\'|'"') )* '"' |
 
 "'" ( '\\' ('b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\') |
!('\\'|"'") )* "'"
 
 ; terminal ML_COMMENT : '/*' -> '*/';
 
terminal SL_COMMENT : '//' !('\n'|'\r')* ('\r'? '\n')?;
 
 
terminal WS : (' '|'\t'|'\r'|'\n')+;
 
 
 
 
</blockquote>
</body>
</html>

Ed Merks
Professional Support: https://www.macromodeling.com/

Report message to a moderator

Re: Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #694160 is a reply to message #694142]

Thu, 07 July 2011 23:52

Ajit Dingankar

Messages: 12
Registered: July 2011

Junior Member

Hi Ed!
Thanks for your reply. I understand the situation better now, but still think that
the expressive power of an Xtext language (range of input matches) should not be
limited by the choice of serialization, viz. XML. Instead of getting into details
of Unicode, XML etc, let me explain the original motivation for trying the WORD
rule.

I want to generate a parser to process a binary stream of data (which seems to be
possible with Antlr, though I haven't tried it) hence I need to match the input as
BYTE (arbitrary 8 bit character) or WORD (two bytes) at a time.

Thanks,
Ajit
====

Report message to a moderator

Re: Invalid XML character (Unicode: 0x0) exception with xmlVersion 1.1 in Xtext2 [message #704433 is a reply to message #694160]

Thu, 28 July 2011 16:11

Ajit Dingankar

Messages: 12
Registered: July 2011

Junior Member

OK, I tried ANTLR and it doesn't seem to handle characters with codes above 127
correctly, but Flex works with "-8" switch to generate 8 bit lexer. So I'll be
generating a flex specification from my Xtext model.

If 8-bit characters are supported on input, I'd generate another Xtext model and
be completely contained in Xtext. Wink

Regards,
Ajit
====

Report message to a moderator

Previous Topic:	scope and qualified name
Next Topic:	Google Guice Exception starting my modelproject with XText2.0

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 26 05:10:40 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter