Eclipse Community Forums: EMF » Ignoring whitespace before XML header

Help

Home

Home » Modeling » EMF » Ignoring whitespace before XML header

Show: Today's Messages :: Show Polls :: Message Navigator

Ignoring whitespace before XML header [message #1800487]

Sun, 30 December 2018 10:48

Andreas Graf

Messages: 211
Registered: July 2009

Senior Member

Hi all,

we have been creating an EMF model for an existing XML from a commercial tool.
Everything works fine, except for the fact that the commercial tool creats whitespace in front of the XML header, which seems violating the XML spec, and we see the following exception:

Exception in thread "main" org.eclipse.emf.ecore.resource.impl.ResourceSetImpl$1DiagnosticWrappedException: org.xml.sax.SAXParseExceptionpublicId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; systemId: file:/C:/Z03_P-I/arwebinar/runtime-EclipseApplication/adtf/src/small.ddl.description; lineNumber: 2; columnNumber: 6; Verarbeitungsanweisungsziel, das "[xX][mM][lL]" entspricht, ist nicht zulässig.
at org.eclipse.emf.ecore.resource.impl.ResourceSetImpl.handleDemandLoadException(ResourceSetImpl.java:319)

If we remove the whitespace so that the file immediately starts with <?xml version="1.0" encoding="iso-8859-1" standalone="no"?> it works fine.

However, since the number of XML files is quite large, it would be convenient for the users if we could avoid any (manual) preprocessing scripts.

Is anywone aware of any possible solution that we could integrate into our Resource loading? (We do have custom resource and resource set implementations).

Thanks,

Andreas

Report message to a moderator

Re: Ignoring whitespace before XML header [message #1800488 is a reply to message #1800487]

Sun, 30 December 2018 11:04

Ed Merks

Messages: 33217
Registered: July 2009

Senior Member

I'm surprised that any tool would create invalid XML. No SAX processor will/can process such a file so I'm not sure how such a tool itself can read such a corrupt thing.

I can't imagine a clean way to deal with such a thing. While you might use an input stream that supports mark and reset to skip leading white space before forwarding the stream for regular loading, you don't know the encoding being used so you won't be able to properly decode the leading white space. That's why an XML file must start with < so that the XML processor can determine the encoding (e.g., UTF-8/16/32) while processing the <?xml processing instruction itself.

Ed Merks
Professional Support: https://www.macromodeling.com/

Report message to a moderator

Re: Ignoring whitespace before XML header [message #1800491 is a reply to message #1800488]

Sun, 30 December 2018 12:19

Ed Willink

Messages: 7670
Registered: July 2009

Senior Member

Hi

Since you have a custom Resource and a known bad source, you can probably arrange for your custom resource load to start by performing a byte-read, trim the prefix and save before any Eclipse tooling gets to see the corrupt file. You might even try registering a platform-level resource-change listener so that as soon as Eclipse refreshes the project, it runs the corrector. Or you might register a FixUpNature and FixUpBuilder that do similar corrections. Then you will see good files in text editors too.

But much simpler to not generate bad files in the first place. Can you apply pressure to the vendor?

Regards

Ed Willink

Report message to a moderator

Re: Ignoring whitespace before XML header [message #1800495 is a reply to message #1800491]

Sun, 30 December 2018 16:11

Ed Merks

Messages: 33217
Registered: July 2009

Senior Member

That's the problem though: how to read the bytes and interpret them as spaces (tabs, line feeds), if it's not actually a < as it should be, without knowing the encoding, which is specified later in the document. I assume that's possible in principle; after all the entire processing instruction is processed by SAX without prior knowledge of the encoding in order to parse the encoding; that no doubt benefits from the fact that only ASCII characters an occur, so it should be possible recognize if the bytes are a one-byte space (UTF-8 and various ISO Latin encodings), a 2 byte space (UTF-16), or a 4 byte space (UTF-32). There might be a BOM as well; that's actually allowed and ignored. An override of ResourceImpl.doLoad could trim such things in principle.

Ed Merks
Professional Support: https://www.macromodeling.com/

Report message to a moderator

Previous Topic:	XSD => Ecore mapping without FeatureMaps
Next Topic:	[CDO] Auto-Commit in CDO

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Tue Sep 24 13:41:56 GMT 2024

.:: Contact :: Home ::.

Breadcrumbs

Sign up to our Newsletter