Experiments in Automated Linking of TEI Transcripts to Manuscript Images
(Hugh Cayless)
At UNC Chapel Hill, we have been conducting a number of experiments to determine the feasibility of automatic linking between transcriptions of documents in the First Century of the First State University collection. Final results of these experiments are not yet available, but the techniques we have been employing show interesting results. Using Free, Open Source tools and languages, we have been able to convert manuscript images into Scalable Vector Graphics (SVG), which can then be linked to lines in our TEI files.
The method we are attempting to refine involves using potrace, which can convert bitmapped images to vector formats, to turn manuscript images (first converted to bitmap using ImageMagick) into SVG. Letters and letter parts become SVG <path> elements. Once this is done, it is possible to use statistical methods to detect lines in the image and clusters of letters in a line. These clusters will subsequently be matched to words in the TEI file, which has been automatically modified to mark individual words. We hope the outcome of this experimentation will help us to learn what portions of this process may be accomplished automatically and what must involve human intervention.
A second motivation for this set of experiments is the nascent effort to explore the mass digitization of manuscripts in the Southern Historical Collection at UNC. Such an effort will produce a mass of organized page images without transcriptions. We hope to find out whether the methods we are testing can provide a framework wherein user supplied transcripts can be easily linked to these images. Since SVG is an XML language, the linking of words and lines to features in the vector image can be accomplished using standard XML linking mechanisms. Moreover, it will be possible to construct a webbased view, in which the SVG is overlaid on the original image, and selection of the paths and links to the transcript can be achieved using Javascript functions.
The tests described here are ongoing, but we have been able to generate legible SVG images from our source images and to detect lines within those images by analyzing the path information using Python. Attempts to detect words using statistical clustering methods (such as k-means) have not yet yielded fruit, but it is possible that they will if the algorithm is modified so that placement of the original centroids is not random, but instead follows the lines already detected in the image.
This paper will present a discussion of the experiments, methods and their outcomes, and showcase some possible web-based methods for viewing and manipulating the results.