Automagical conversion notes


Automagical Conversion

These notes describe one way of converting Word documents to a well-marked up TEI XML corpus.

You will need the following tools:
The TEI Open Office Filters are installed as follows:
  • Download the file teioop5.jar from http://www.tei-c.org/Software/teioo/
  • Open Open Office Writer and select XML Filter Settings from the Tools menu.
  • Click the Open Package button, navigate to the file teioop5.jar, and select it.
  • TEIP5 now appears as one of the available filter options.
Now proceed as follows:
  • Open any Word document using the Open command on the File menu of Open Office
  • Select Save As from the File menu.
  • Scroll down the list of available File types to TEI P5 (.xml) and press Save
  • Your document will now be saved as an XML file

If you look at the document in your XML editor, you will probably spot some tagging you'd like to improve and some data you weren't expecting to see in the TEI header. But it's a start! The document can also be automatically tagged by CLAWS...

  • Select the body of your XML text with the mouse, copy it to the clipboard, and paste it into the form at http://www.comp.lancs.ac.uk/ucrel/claws/trial.html . Make sure you select the Pseudo-XML output style
  • Press the Tag text Now button. Your text reappears with detailed markup, in which each word has been given a number and a POS code
  • Finally, you may like to make explicit the sentence divisions which CLAWS has introduced. To do this, you will need to transform the output using a simple XSLT stylesheet, in the same way as before.
  • Select the CLAWS output, copy it to the clipboard, and paste it back into your XML editor window, replacing the untagged version.
  • Now download this stylesheet and configure a new transfiguration scenario for Oxygen to use it.
  • You should also be able to load the transformed file into Xaira using the corpus wizard.

Good luck!