Automagical conversion notes
Automagical Conversion
These notes describe one way of converting Word documents to a well-marked up TEI XML corpus.
You will need the following tools:
- A copy of Open Office, a well known open source replacement for Microsoft Office.
- The TEI Open Office filters
- Access to a suitable online tagger: we used the CLAWS Trial Tagger from the University of Lancaster
- The XML tools used on this course
The TEI Open Office Filters are installed as follows:
- Download the file teioop5.jar from http://www.tei-c.org/Software/teioo/
- Open Open Office Writer and select XML Filter Settings from the Tools menu.
- Click the Open Package button, navigate to the file teioop5.jar, and select it.
- TEIP5 now appears as one of the available filter options.
Now proceed as follows:
- Open any Word document using the Open command on the File menu of Open Office
- Select Save As from the File menu.
- Scroll down the list of available File types to TEI P5 (.xml) and press Save
- Your document will now be saved as an XML file
If you look at the document in your XML editor, you will probably spot some tagging you'd like to improve and some data you weren't expecting to see in the TEI header. But it's a start! The document can also be automatically tagged by CLAWS...
- Select the body of your XML text with the mouse, copy it to the clipboard, and paste it into the form at http://www.comp.lancs.ac.uk/ucrel/claws/trial.html . Make sure you select the Pseudo-XML output style
- Press the Tag text Now button. Your text reappears with detailed markup, in which each word has been given a number and a POS code
- Finally, you may like to make explicit the sentence divisions which CLAWS has introduced. To do this, you will need to transform the output using a simple XSLT stylesheet, in the same way as before.
- Select the CLAWS output, copy it to the clipboard, and paste it back into your XML editor window, replacing the untagged version.
- Now download this stylesheet and configure a new transfiguration scenario for Oxygen to use it.
- You should also be able to load the transformed file into Xaira using the corpus wizard.
Good luck!