TEI stylesheet for converting Word docx files to TEI
This library is free software; you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License as published by the Free Software
Foundation; either version 2.1 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You
should have received a copy of the GNU Lesser General Public License along with this
library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite
330, Boston, MA 02111-1307 USA
The main template that starts the conversion from docx to TEI
IMPORTING STYLESHEETS AND OVERRIDING MATCHED TEMPLATES:
When importing a stylesheet (xsl:import) all the templates in the imported stylesheet
get a lower import-precedence than the ones in the importing stylesheet. If the importing
stylesheet now wants to override, let's say a general template to match all <w:p> elements
where no more specialized rule applies it can't since it will automatically override
all w:p[someprediceat] template in the imported stylesheet as well.
In this case we have outsourced the processing of the general template into a named template
and all the imported stylesheet does is to call the named template. Now, the importing
stylesheet can simply override the named template, and everything works out fine.
See templates:
- w:p (mode: paragraph)
Modes:
part0:
a normalization
process for styles. Can also detect illegal styles.
part2:
templates that are relevant in the second stage of the conversion are
defined in mode "part2"
inSectionGroup:
Defines a template
that is working o a
group of consecutive
elements (w:p or w:tbl
elenents)
that form a section (a normal section not to be confused with w:sectPr).
paragraph:
Defines that the template works on an individual element (usually starting with a
w:p element).
<xsl:template match="/"><!-- Do an initial normalization and store everything in $part0 --><xsl:variable name="part0"><xsl:apply-templates mode="part0"/></xsl:variable><!-- Do the main transformation and store everything in the variable part1 --><xsl:variable name="part1"><xsl:for-each select="$part0"><xsl:apply-templates/></xsl:for-each></xsl:variable><!-- Do the final parse and create valid TEI --><xsl:apply-templates select="$part1" mode="part2"/><xsl:call-template name="fromDocxFinalHook"/></xsl:template>
<xsl:template match="w:document"><TEI><!-- create teiHeader --><xsl:call-template name="create-tei-header"/><!-- convert main and back matter --><xsl:apply-templates select="w:body"/></TEI></xsl:template>
<xsl:template match="w:body"><text><!-- Create forme work --><xsl:call-template name="extract-forme-work"/><!-- create TEI body --><body><!--
group all paragraphs that form a first level section.
--><xsl:for-each-group select="w:sdt|w:p|w:tbl" group-starting-with="w:p[teidocx:is-firstlevel-heading(.)]"><xsl:choose><!-- We are dealing with a first level section, we now have
to further divide the section into subsections that we can then
finally work on --><xsl:when test="teidocx:is-heading(.)"><xsl:call-template name="group-by-section"/></xsl:when><!-- We have found some loose paragraphs. These are most probably
front matter paragraps. We can simply convert them without further
trying to split them up into sub sections. --><xsl:otherwise><xsl:apply-templates select="." mode="inSectionGroup"/></xsl:otherwise></xsl:choose></xsl:for-each-group><!-- I have no idea why I need this, but I apparently do.
//TODO: find out what is going on--><xsl:apply-templates select="w:sectPr" mode="paragraph"/></body></text></xsl:template>
There are certain elements, that we don't really care about, but that
force us to regroup everything from the next sibling on.
@see grouping in construction of headline outline.
Grouping consecutive elements that belong together
We are now working on a group of all elements inside some group bounded by
headings. These need to be further split up into smaller groups for figures,
list etc. and into individual groups for simple paragraphs...
<xsl:template match="w:tbl|w:p" mode="inSectionGroup"><!--
We are looking for:
- Lists -> 1
- Table of Contents -> 2
Anything else is assigned a number of position()+100. This should be
sufficient even if we find lots more things to group.
--><xsl:for-each-group select="current-group()" group-adjacent="if (contains(w:pPr/w:pStyle/@w:val,'List')) then 1 else if (starts-with(w:pPr/w:pStyle/@w:val,'toc')) then 2 else position() + 100"><!-- For each defined grouping call a specific template. If there is no
grouping defined, apply templates with mode paragraph --><xsl:choose><xsl:when test="current-grouping-key()=1"><xsl:call-template name="listSection"/></xsl:when><xsl:when test="current-grouping-key()=2"><xsl:call-template name="tocSection"/></xsl:when><!-- it is not a defined grouping .. apply templates --><xsl:otherwise><xsl:apply-templates select="." mode="paragraph"/></xsl:otherwise></xsl:choose></xsl:for-each-group></xsl:template>
Looks through the document to find forme work related sections.
Creates a <fw> element for each forme work related section. These include
running headers and footers. The corresponding elements in OOXML are w:headerReference
and w:footerReference. These elements only define a reference that to a header or
footer definition file. The reference itself is resolved in the file word/_rels/document.xml.rels.