Ignoring whitespace before XML header [message #1800487] |
Sun, 30 December 2018 10:48 |
Andreas Graf Messages: 211 Registered: July 2009 |
Senior Member |
|
|
Hi all,
we have been creating an EMF model for an existing XML from a commercial tool.
Everything works fine, except for the fact that the commercial tool creats whitespace in front of the XML header, which seems violating the XML spec, and we see the following exception:
Exception in thread "main" org.eclipse.emf.ecore.resource.impl.ResourceSetImpl$1DiagnosticWrappedException: org.xml.sax.SAXParseExceptionpublicId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; systemId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; lineNumber: 2; columnNumber: 6; Verarbeitungsanweisungsziel, das "[xX][mM][lL]" entspricht, ist nicht zulässig.
at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.handleDemandLoadException(ResourceSetImpl.java:319)
If we remove the whitespace so that the file immediately starts with <?xml version="1.0" encoding="iso-8859-1" standalone="no"?> it works fine.
However, since the number of XML files is quite large, it would be convenient for the users if we could avoid any (manual) preprocessing scripts.
Is anywone aware of any possible solution that we could integrate into our Resource loading? (We do have custom resource and resource set implementations).
Thanks,
Andreas
|
|
|
|
|
Re: Ignoring whitespace before XML header [message #1800495 is a reply to message #1800491] |
Sun, 30 December 2018 16:11 |
Ed Merks Messages: 33217 Registered: July 2009 |
Senior Member |
|
|
That's the problem though: how to read the bytes and interpret them as spaces (tabs, line feeds), if it's not actually a < as it should be, without knowing the encoding, which is specified later in the document. I assume that's possible in principle; after all the entire processing instruction is processed by SAX without prior knowledge of the encoding in order to parse the encoding; that no doubt benefits from the fact that only ASCII characters an occur, so it should be possible recognize if the bytes are a one-byte space (UTF-8 and various ISO Latin encodings), a 2 byte space (UTF-16), or a 4 byte space (UTF-32). There might be a BOM as well; that's actually allowed and ignored. An override of ResourceImpl.doLoad could trim such things in principle.
Ed Merks
Professional Support: https://www.macromodeling.com/
|
|
|
Powered by
FUDForum. Page generated in 0.03551 seconds