Importing Word content [message #51108] |
Wed, 18 June 2008 07:33 |
Eclipse User |
|
|
|
Hi again,
I have been experimenting with ways to automate or improve importing of
content from Word into EPF. One issue is how to clean up the Word HTML.
I have successfully made a small converter, using HTML Tidy (batch) with a
configuration file:
tidy config configWordClean.txt f errors.txt m [filename].htm
---
// sample config file for HTML tidy
indent: auto
indent-spaces: 2
wrap: 72
word-2000: yes
clean: yes
markup: yes
output-xml: yes
input-xml: no
doctype: omit
show-warnings: yes
numeric-entities: yes
quote-marks: yes
quote-nbsp: yes
quote-ampersand: no
break-before-br: no
uppercase-tags: no
uppercase-attributes: no
char-encoding: latin1
---
The result is only a starting point. In the second step, I use a custom
made WordTidy.xslt to filter the remainder to suit EPF more specifically.
---
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
exclude-result-prefixes="fn xsl xs">
<xsl:output method="xhtml" encoding="ISO-8859-1" indent="yes"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="html">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="head">
</xsl:template>
<xsl:template match="body">
<body>
<xsl:apply-templates/>
</body>
</xsl:template>
<xsl:template match="div">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="table">
<table width="{@width}" border="{@border}" cellspacing="{@cellspacing}"
cellpadding="{@cellpadding}">
<xsl:apply-templates/>
</table>
</xsl:template>
<xsl:template match="tbody">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="*[text() = ' ' ]"/>
<xsl:template match="span[@class='c1']"/>
<xsl:template match="tr">
<tr>
<xsl:apply-templates/>
</tr>
</xsl:template>
<xsl:template match="td | th">
<td width="{@width}" valign="{@valign}">
<xsl:apply-templates/>
</td>
</xsl:template>
<xsl:template match="h1 | h2 | h3">
<h3>
<xsl:apply-templates/>
</h3>
</xsl:template>
<xsl:template match="h4">
<h4>
<xsl:apply-templates/>
</h4>
</xsl:template>
<xsl:template match="h5">
<h5>
<xsl:apply-templates/>
</h5>
</xsl:template>
<xsl:template match="span">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="p">
<p>
<xsl:apply-templates/>
</p>
</xsl:template>
<xsl:template match="img">
<img width="{@width}" height="{@height}">
<xsl:attribute name="src" select=" concat('resources/',
substring-after(@src, '/')) "/>
<xsl:apply-templates/>
</img>
</xsl:template>
<xsl:template match="br"/>
<xsl:template match="ul">
<ul>
<xsl:apply-templates/>
</ul>
</xsl:template>
<xsl:template match="li">
<li>
<xsl:apply-templates/>
</li>
</xsl:template>
</xsl:stylesheet>
---
Notice how I enforce a rule whereby all images are located relative to the
html in a /ressources folder (within the EPF project structure).
Another step is then to place these images in the right place so the
references are valid!
---
The above technique really helps a lot, but I would like to automate it
even more. Why should I have to manually Insert the HTML into the RTE each
time. Why not use the EPF API to do this more directly???
Has any work been done towards this goal already?
Kristian
|
|
|
Powered by
FUDForum. Page generated in 0.04771 seconds