Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » EMF » Ignoring whitespace before XML header
Ignoring whitespace before XML header [message #1800487] Sun, 30 December 2018 10:48 Go to next message
Andreas Graf is currently offline Andreas GrafFriend
Messages: 211
Registered: July 2009
Senior Member
Hi all,

we have been creating an EMF model for an existing XML from a commercial tool.
Everything works fine, except for the fact that the commercial tool creats whitespace in front of the XML header, which seems violating the XML spec, and we see the following exception:

Exception in thread "main" org.eclipse.emf.ecore.resource.impl.ResourceSetImpl$1DiagnosticWrappedException: org.xml.sax.SAXParseExceptionpublicId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; systemId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; lineNumber: 2; columnNumber: 6; Verarbeitungsanweisungsziel, das "[xX][mM][lL]" entspricht, ist nicht zulässig.
at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.handleDemandLoadException(ResourceSetImpl.java:319)

If we remove the whitespace so that the file immediately starts with <?xml version="1.0" encoding="iso-8859-1" standalone="no"?> it works fine.

However, since the number of XML files is quite large, it would be convenient for the users if we could avoid any (manual) preprocessing scripts.

Is anywone aware of any possible solution that we could integrate into our Resource loading? (We do have custom resource and resource set implementations).

Thanks,

Andreas
Re: Ignoring whitespace before XML header [message #1800488 is a reply to message #1800487] Sun, 30 December 2018 11:04 Go to previous messageGo to next message
Ed Merks is currently offline Ed MerksFriend
Messages: 30203
Registered: July 2009
Senior Member
I'm surprised that any tool would create invalid XML. No SAX processor will/can process such a file so I'm not sure how such a tool itself can read such a corrupt thing.

I can't imagine a clean way to deal with such a thing. While you might use an input stream that supports mark and reset to skip leading white space before forwarding the stream for regular loading, you don't know the encoding being used so you won't be able to properly decode the leading white space. That's why an XML file must start with < so that the XML processor can determine the encoding (e.g., UTF-8/16/32) while processing the <?xml processing instruction itself.
Re: Ignoring whitespace before XML header [message #1800491 is a reply to message #1800488] Sun, 30 December 2018 12:19 Go to previous messageGo to next message
Ed Willink is currently offline Ed WillinkFriend
Messages: 6388
Registered: July 2009
Senior Member
Hi

Since you have a custom Resource and a known bad source, you can probably arrange for your custom resource load to start by performing a byte-read, trim the prefix and save before any Eclipse tooling gets to see the corrupt file. You might even try registering a platform-level resource-change listener so that as soon as Eclipse refreshes the project, it runs the corrector. Or you might register a FixUpNature and FixUpBuilder that do similar corrections. Then you will see good files in text editors too.

But much simpler to not generate bad files in the first place. Can you apply pressure to the vendor?

Regards

Ed Willink
Re: Ignoring whitespace before XML header [message #1800495 is a reply to message #1800491] Sun, 30 December 2018 16:11 Go to previous message
Ed Merks is currently offline Ed MerksFriend
Messages: 30203
Registered: July 2009
Senior Member
That's the problem though: how to read the bytes and interpret them as spaces (tabs, line feeds), if it's not actually a < as it should be, without knowing the encoding, which is specified later in the document. I assume that's possible in principle; after all the entire processing instruction is processed by SAX without prior knowledge of the encoding in order to parse the encoding; that no doubt benefits from the fact that only ASCII characters an occur, so it should be possible recognize if the bytes are a one-byte space (UTF-8 and various ISO Latin encodings), a 2 byte space (UTF-16), or a 4 byte space (UTF-32). There might be a BOM as well; that's actually allowed and ignored. An override of ResourceImpl.doLoad could trim such things in principle.
Previous Topic:XSD => Ecore mapping without FeatureMaps
Next Topic:[CDO] Auto-Commit in CDO
Goto Forum:
  


Current Time: Sun Jun 16 07:22:49 GMT 2019

Powered by FUDForum. Page generated in 0.01981 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top