Schema Harvesting for Conversion of Variant Text Corpora
(Brian L. Pytlik Zillig)
One of the aims of the MONK Project is to aggregate various humanities text collections into a common form to enable large-scale text analysis. A custom subset of TEI's P5, which we call TEI-Analytics (TEI-A), constitutes that common form. TEI-A contains up to 122 valid elements. In light of the sometimes widely divergent encoding practices between text collections (indeed, some are not in XML at all), plus variations within collections, it is a non-trivial venture to convert files so that they may interoperate. Refashioning hundreds or thousands of novels and plays into TEI-A requires a programmatic approach that can be performed as a batch operation.
Our basic approach to the problem involves schema harvesting. The TEI Consortium's Roma tool was used to create a base XML schema for TEI P5 documents, which we then extended using a custom ODD file. The schema contains the document logic for files that validate under TEI-A. Because this logic exists in an XML file it can be queried and exposed via XSLT.
With the TEI-A schema in place, an XSLT "meta-stylesheet'' (MonkMetaStylesheet.xsl) consults the schema to determine the form into which the source files may be converted. This initial XSLT stylesheet is a meta-stylesheet in the sense that it programmatically authors another XSLT stylesheet. The first stylesheet, at roughly 500 lines of XSLT, is fairly short. The second is very long, around 7,000 lines. This second stylesheet (XMLtoMonkXML.xsl) contains the conversion instructions to get from P4, or other markup, to the TEI-A custom P5 implementation.
The basic principles of conversion are fairly uncomplicated. Some elements in the TEI header are discarded to keep things simple. In the text, elements that are permitted in TEI-A are passed through. Those that are not have their tags stripped and the text nodes are passed through. For each element that passes through the transformation, each attribute is reviewed against the TEI-A schema. Attributes that are not permitted are removed, and those that are required are added. A key point is that, apart from the header information, no text nodes are removed.
Elements that are not needed for analysis are removed or re-named according to the requirements of MONK (for example, numbered <div>s are replaced with un-numbered <div>s). Any special mappings, or additions/subtractions, that are desired are added to the MonkMetaStylesheet.xsl file, and take this form:
Converted files are parsed to determine whether or not they are valid under the TEI-A schema. Use of this technique has resulted in the successful conversion into TEI-A of Text Creation Partnership files, as well as files from Wright American Fiction, Eighteenth Century Fiction and Nineteenth Century Fiction collections.