Skip to main content


Eclipse Community Forums
Forum Search:

Search      Help    Register    Login    Home
Home » Modeling » ATL » UTF-8 encoding problem in output
UTF-8 encoding problem in output [message #1828140] Tue, 02 June 2020 13:19 Go to next message
Gunnar Arndt is currently offline Gunnar ArndtFriend
Messages: 82
Registered: June 2012
Member
The input model (stored in an UTF-8 encoded XMI resource) to my ATL transformation contains string attributes whose value may include characters from a 'special' font called Symbol:
<message key="en" value="Resistor ( &lt;font face='Symbol'>...&lt;/font> )"/>

The three dots are shown as an Omega (for the physical unit Ohm of a resistor) in the application from which the data originates; other less obvious characters can occur as well. A hex editors shows that it is stored as the bytes EF 81 97.
After transformation, the XML output file contains the bytes EF 3F 97 instead, which cannot be processed by any of the tested editors. The value string is just copied by the transformation, not modified.
The ATL transformation has been executed as an ATL Plug-in from Java code; UTF-8 is explicitly set as the output encoding:
extractor.extract(outModel, outModelPath, Collections.<String, Object> singletonMap(XMLResource.OPTION_ENCODING, "UTF-8"));

An XML (not XMI) resource is used for the output. I use ATL 4.2.0, EMFVM, atlcompiler atl2010.
What happens during the ATL transformation to destroy the character? How can I fix it?
Thank you for your help.
Re: UTF-8 encoding problem in output [message #1828368 is a reply to message #1828140] Mon, 08 June 2020 14:57 Go to previous message
Gunnar Arndt is currently offline Gunnar ArndtFriend
Messages: 82
Registered: June 2012
Member
I managed to work around the issue by replacing the Unicode character (from a customer specific subset) during creation of the input model by an HTML reference via its code point:

    static final private String FONT_CHARACTER_REGEX_PREFIX_GROUP_NAME = "prefix";
    static final private String FONT_CHARACTER_REGEX_CHAR_GROUP_NAME = "char";
    static final private String FONT_CHARACTER_REGEX_SUFFIX_GROUP_NAME = "suffix";
    static final private String FONT_CHARACTER_REGEX =
        String.format("(?<%s><font face='[\\w ]+?'>)(?<%s>.)(?<%s></font>)", FONT_CHARACTER_REGEX_PREFIX_GROUP_NAME,
            FONT_CHARACTER_REGEX_CHAR_GROUP_NAME, FONT_CHARACTER_REGEX_SUFFIX_GROUP_NAME);
    static final private Pattern FONT_CHARACTER_PATTERN = Pattern.compile(FONT_CHARACTER_REGEX);

    /**
     * Certain FontCharacters may be replaced by later ATL transformation. That is
     * avoided by referring to them by their Unicode code point instead.
     * 
     * @param input
     *        A string to check for FontCharacter substrings.
     * @return An almost identical string, in which any FontCharacter has been replaced by an HTML
     *         reference via its code point.
     */
    static protected String fontCharacterToCodePoint(final String input) {
        final StringBuffer output = new StringBuffer();
        final Matcher matcher = FONT_CHARACTER_PATTERN.matcher(input);
        while (matcher.find()) {
            final int fontCharacterStart = matcher.start(FONT_CHARACTER_REGEX_CHAR_GROUP_NAME);
            final int fontCharacterCodePoint = Character.codePointAt(input, fontCharacterStart);
            final String replacement = String.format("${%s}&#x%04X${%s}", FONT_CHARACTER_REGEX_PREFIX_GROUP_NAME,
                fontCharacterCodePoint, FONT_CHARACTER_REGEX_SUFFIX_GROUP_NAME);
            matcher.appendReplacement(output, replacement);
        }
        return matcher.appendTail(output).toString();
    }
Previous Topic:[ATL /AM3 XML Extractor ]
Next Topic:if else statment in a rule transformation
Goto Forum:
  


Current Time: Fri Apr 26 07:52:28 GMT 2024

Powered by FUDForum. Page generated in 0.04237 seconds
.:: Contact :: Home ::.

Powered by: FUDforum 3.0.2.
Copyright ©2001-2010 FUDforum Bulletin Board Software

Back to the top