Electronic Textual Editing: Effective Methods of Producing Machine-Readable Text from Manuscript and Print Sources [Eileen Gifford Fenton (JSTOR) and Hoyt N. Duggan (University of Virginia)]
The choice of a method for generating digital text from manuscript or printed sources requires careful deliberation. No single method is suitable for every project, and the very range of options can be misleading. Therefore, thoughtful consideration of the original source and analysis of project goals are essential. This essay considers two different types of source materials—manuscript and print. The discussion of manuscript sources is provided by Hoyt Duggan, based on experience in the Piers Plowman project, and the discussion of print sources is provided by Eileen Fenton, based on experience with JSTOR.
Manuscript Sources
Creating digital text from handwritten documents requires all of the traditional editorial and bibliographic disciplines necessary for publication in print plus some mastery of SGML/XML markup. The observations in the first part of this article are the result of the author's experiences as director of the Piers Plowman Electronic Archive: (PPEA), a project designed to represent the entire textual tradition of a poem that survives in three authorial versions (conventionally known as A, B, and C) plus another dozen or so scribal versions. 1 Ten manuscripts attest the A version, eleven manuscripts and three sixteenth-century printed editions the B version, and twenty manuscripts the C version. Four spliced versions of ABC texts exist as well as seven AC splices and one AB splice. Of these sixty-plus witnesses no two are identical. The complete corpus consists of well over three million word forms and 10,000 manuscript pages. The oldest manuscripts date from the last decade of the fourteenth century; most are from the first half of the fifteenth, and a handful are sixteenth-century. Most are written in professional book hands—usually varied forms of Anglicana scripts. A few of the later ones are written in a highly cursive secretary script.
Available Methods
Optical character recognition (OCR) is not a viable option for early hand-written documents with their variable letter forms and unstandardized spelling systems. Nor is OCR satisfactory for duplicating early printed texts like Robert Crowley's texts of 1550. The crude early fonts and the frequent instances of broken or malformed characters present even more problems than the professionally executed fifteenth-century manuscripts. Keyboard entry is the only reliable means for converting handwritten documents into machine-readable form.
Text entry from handwritten sources requires a highly motivated and well-trained staff able to comprehend a late Middle English text and to interpret a number of scripts and hands. 2 We were fortunate to have generous funding initially from the Institute for Advanced Technology in the Humanities (IATH) at the University of Virginia and, later, from three two-year grants from the National Endowment for the Humanities. Both grants have provided monies to employ and train a staff of highly competent young medievalists from the graduate programs in English, French, and classics at the University of Virginia. Working together, the team has transcribed forty-four manuscripts—all of the B witnesses, all but one of the A manuscripts, all of the AC splices—and has begun transcription of the C manuscripts.
What should we represent in a transcription?
Naively, I had expected that editing with the computer would be a good bit faster than editing with pen and paper, but the opposite has proved to be the case. Transcribing a manuscript electronically involves all of the work that both modern print editors and medieval scribes have traditionally done. Text entry is still done one character at a time, letter by letter, space by space. Proofreading remains the same demanding activity it has always been, requiring serial re-readings to compensate for eye-skip, arrhythmia, dittography, homoeoteleuton, and for the manifold other failures of concentration that have marred scribal efforts since literacy began. We have found ourselves making the same kinds of error at the keyboard that medieval scribes made when inscribing animal skins with quills. Indeed, we have introduced other forms of error facilitated by word processing software! Moreover, if the base transcription of the manuscript readings could be miraculously rendered error free, transcribing would still remain labor intensive since the electronic text with its potential for markup is capable of conveying far more information than printed editions ever attempted to convey. Such capabilities present endless possibilities for expanding the editorial project; once detailed markup has begun, it is difficult to determine where to stop.
Reading Piers Plowman as a printed text can be hugely misleading. So often students take away with them the notion of Langland as a dissident writer, operating at the margins of society, an idea encouraged by Langland himself, particularly in the C-Version, where he portrays himself as a west-country exile, perching precariously on London society, supported by a coterie of friends. For a writer of this sort, texts will surely have circulated as samizdat, clandestine writings hastily scribbled by enthusiasts, passed from hand to hand at gatherings of the disaffected? Of course nothing could be further from the truth, and no manuscript gives the lie to it more convincingly than Cambridge, Trinity College, MS B.15.17.
Such historical and bibliographic insight is enabled in electronic editions because SGML/XML markup, in addition to representing the underlying structures of the poetic text, also makes it possible to represent its formal bibliographical and codicological features in machine-readable and manipulable form.
Early in the process, the editor must determine how fine-grained the transcription is to be because markup permits the specification in minute detail of the paleographic features of the document. The scribe of Trinity College MS B.15.17, for instance, made use of two forms of the letter <a> and three of capital <A> ; two forms of <r> ; and three of lower case <s> — a long <s> appearing medially, a sigma-shaped <s> at the beginning of words, and an 8-shaped <s> at the end. Using entity references, we might well have distinguished each of these and devised fonts and style sheets to distinguish the allographs. However, because our focus in creating the Archive is to represent the text, we have not attempted to represent every possible paleographic or other physical features of the documents. We have chosen to transcribe graphemically rather than graphetically, ignoring differences insignificant for establishing the text. A student of letter forms, however, might well choose to tag in order to represent features we have elided. And that would be entirely feasible. Since we leave our basic ASCII transcriptions available and transparent to users, scholars with a focus on other aspects of these texts may use our transcriptions and the attached, hyper-textually linked color digital images to reflect their different interests in the document. The options for markup at the time of transcription are many. We might have chosen to attach lexical and/or grammatical identifier tags to every word or to identify, syllable by syllable, the placement of ictus and arsis. Proper nouns might have been distinguished from common nouns, count nouns from mass nouns. Virtually any linguistic or prosodic feature of interest can be represented; consequently, the editor must make fundamental choices as to what features he will mark.
Establishing Protocols
Early in the process of creating the PPEA: , we established a set of Transcriptional Protocols to serve as a guide for our transcribers. They are posted on the Archive: web site and supplied as hard copy to each of the project's editors. These appear at URL: http://jefferson.village.virginia.edu/seenet/piers/protocoltran.html . They have necessarily gone through several revisions since 1994, reflecting both the state of our knowledge and the stage the texts of the Archive have reached.
Sources for transcriptions
Ideally, the transcriber would work directly from the primary document. Failing direct access to the original, the transcriber should work from high quality color digital images of each page of each manuscript. Ideal as either would be, on the Archive we have had access to the original or to color images for just four of the forty-four manuscripts we have transcribed. Of necessity, we have worked from black and white microfilm. Initially we made xerox flow copies for our transcribers. However, the duplications of the microfilm, itself already a copy, became a fertile source of transcriptional error, and we have decided to work directly from microfilm. Microfilm, of course, does not convey elements of color; consequently additional markup must be inserted later when editors have access to color images or to the manuscript itself. When libraries become more accustomed to producing high quality digital images as well as equipped to produce such images, scholars who can find the funding to purchase digital images will have a better base for scholarly editions than exists at present.
Text-Entry Software
Since 1993, we have used a variety of word-processing and text-entry software for transcribing the manuscripts. Initially we attempted to use the SGML-authoring program Author/Editor. Though it had the advantage of permitting us to parse the document as it was created, it was also slow and excessively complicated. At about the same time we experimented with various word processing programs such as Microsoft's Word and WordPerfect on both Windows and Macintosh platforms and EMACs on UNIX. Eventually we came to use other plain text editors. For the past five years we have used almost exclusively the shareware program NoteTab Pro. Using its clipboard macro feature, we have linked it to the nsgmls parser. 3 Unlike Word and WordPerfect, which created problems by introducing unintended line breaks when we converted those files to plain ASCII, the ASCII editors have proved fast, cheap, and reliable.
We have found it essential to create macros for the SGML tags and to persuade transcribers to use them consistently both at the time of initial file creation and later when more information becomes available or when corrections are entered after proofreading. Individually typed tagging proliferates error.
Version and Quality Control
Both version and quality control have been steady concerns for the Archive. With different people entering, proofreading, and correcting files, file management must be systematic. In one instance, when two different files were used for entering corrections of the same transcription, the correction of the error involved several weeks of tedious labor comparing files and an additional team proofreading (two weeks' work for two people). We have established a single copy of record kept on a single server and a standardized set of procedures for assuring that all work is done on a single copy of record.
Quality control for the transcriptions is assured by multiple layers of proofreading. Initially, various members of the editorial team transcribed from microfilm. However, we have now chosen to make a single transcriber responsible for completing the remaining C manuscripts. That transcriber proofreads each page as it is entered. Next a unique archival line number is assigned to each line using a locally produced Perl script and a set of reference tags with the line numbers of the Athlone edition of the three versions. These will eventually be replaced with a distinctive PPEA reference number, but the present system, which permits users of our editions to compare our texts easily with the Athlone texts, also makes machine collation simpler than it would otherwise have been. After the line references have been inserted, a team of two proofreads the text. One sits at the computer with color digital images of each page (or at the microfilm reader when we lack digital images) and reads the text from the page, letter by letter, space by space, while the other team member checks a printout to verify the accuracy of the interpreted SGML text. To help maintain attention, these readers exchange places and roles at the page break. Each of our published texts has undergone at least three such team proofreadings. After the final team proofing, any necessary corrections are entered by two people, one responsible for text entry, the other assuring that new errors are not introduced in the process of correction.
Print Sources
Producing machine-readable text from print resources offers challenges and options somewhat different from those raised by manuscripts. Consequently an approach that works well for manuscripts may be less successful when applied to mass-produced print sources. Similarly, an approach that works well for one type of printed matter may be less successful when applied to a different type of printed matter. This reality highlights the importance of the point made at the beginning of this essay: there is no right or wrong approach. Instead it is key to find the approach that best matches the aims and mission of the project.
Once the mission to be served is clearly understood, there are a number of other factors which may be useful to consider when evaluating methods for producing machine-readable text. The suggestions offered below for printed materials are derived from JSTOR's experience with one particular type of print resource, the scholarly journal, in pursuit of JSTOR's particular mission. While the factors discussed here are likely to be applicable to a range of other projects and other print sources, this list is surely not exhaustive.
Background on JSTOR
JSTOR (www.jstor.org) is a not-for-profit organization with a mission to help the scholarly community take advantage of advances in information technologies. We pursue this mission through the creation of a trusted archive of core scholarly journals emphasizing the conversion of entire journal back files from volume 1, issue 1 and the preservation of future e-versions of these titles. To date JSTOR has converted over 10 million journal pages from over 240 journals representing more than 170 publishers. The JSTOR archive is available to students, scholars and researches at more than 1,450 libraries and institutions. Each journal page digitized by JSTOR is processed by an optical character recognition application (discussed in more detail below) in order to produce a corresponding text file. The resulting text files are used to support the full text searching offered to JSTOR users and accomplished through a search engine that "reads" or indexes each word of each page and documents its location and proximity to other words. JSTOR does not display these full text files to the user; their role is strictly (and literally) "behind the scenes".
Fundamentals of the OCR Process
JSTOR's use of optical character recognition, or OCR, is certainly not unique. Reliance on this application has become increasingly commonplace as this technology has significantly improved over the last decade. Simply put, OCR is the process that converts the text of a printed page into a digital file. This process has two distinct steps: first, a digital image of the printed page is made, and second, the characters of the digital page image are "read" by the OCR program and saved in a text file. Sometimes these two steps are combined in one "package," but technically they can be separate and distinct activities creating different types of files that could have different uses. In JSTOR, we use the digital page images in order to present journal content to users of the JSTOR archive; we also feed the digital page images to OCR software in order to create full text versions of the journal pages. However, if a digital image has not been separately produced, as may be the case with other projects, the OCR process begins by taking a digital snapshot of the page.
Next, the OCR software examines the digital image and tries to analyze the layout of the contents of the page. It does so by dividing the page into sections referred to as "zones." Zones generally correspond to paragraphs or columns of text or graphical illustrations such as grayscale or color illustrations. While this zoning task would be simple for a human reader with an ability to interpret semantic clues, it can present a variety of challenges for a machine. For example, text will sometimes be displayed in columns. If the columns are very close together, OCR software may continue to interpret from left to right across the page, interleaving the columns. Or, some magazines and journals will use boxes inserted into a column to explain a side issue or for an advertisement, and these may be difficult for the machine to recognize properly. With the page analysis complete, next the order of the zones is determined with the software attempting to replicate "reading order" or the path that a typical reader seeking to understand the full text of the page will follow. The person running the software can, if he or she chooses, review this zone order and make manual adjustments to correct it. With the appropriate zone order established, the analysis of the characters can begin.
Most OCR applications work by looking at character groups, i.e., words, and comparing these to a dictionary included with the application. Just as our eye takes in the shapes of letters and white space on the page, the software interprets black dots and white spaces and compares the patterns with letters or even words it is trained to recognize. When a match is found the software prints to the text file the appropriate word; when a match cannot be confidently made, the software makes a reasonable assumption and flags the word as a low confidence output. Where a word or character cannot be read at all, the default character for illegible text is inserted as a placeholder. The focus of this process is limited to the basic text of the page. Formatting such as bolding, italics or underlining are not captured in this process nor are photos or other non-text illustrations, which can not be read by the OCR application. If the OCR software is relying upon an English dictionary, non-English text, including accents and diacritics will not be recognized. Similarly, symbols such as those used in mathematical, chemical or scientific notations are not recognized.
The software that completes the OCR process is sometimes referred to as an OCR "engine," and OCR applications vary in the number of engines available in a particular commercial product. The software, OmniPage Pro: , for instance, contains a single OCR engine, and this application generally follows the process described above. Prime Recognition, by contrast, is a multi-engine application, where up to 6 engines perform OCR and coordinate the output results in a way that aims to maximize text accuracy. The multi-engine products follow the same process as the single engine software but incorporate some additional steps into the final OCR stages. After determining a best word match each engine reports a confidence rating indicating its probability of error for each word. Then, through a voting mechanism, the engines determine which word match has the highest potential for accuracy, and this selection is printed to the text file. The presence of multiple engines generally will improve the overall accuracy of the final text file by minimizing the effects of the errors of any single engine.
Print Source Characteristics and OCR Quality
Regardless of the number of engines an OCR application includes, a variety of factors can contribute to the overall quality of the OCR output. OCR applications perform best when the source material contains text arranged in even columns with regular margins, printed in a modern font with even dark ink that contrasts sharply with a light paper background free of background marks or bleed through. Pages stained by age, water damage, or debris often lead to poor quality OCR. Pages which contain uneven or narrow column margins, text printed over graphical elements, text that is 6 points or less in size, or very light or very dark printing will also generally produce poor OCR results. Printing flaws such as skewed text or "echo" printing, which creates an appearance similar to double vision, will also dramatically reduce OCR quality.
Key Points When Considering OCR
As noted above, there is no single correct method for producing machine-readable text. In considering which approach may be right for your project, there are at least eight factors you may want to consider.
-
Select an approach for creating machine-readable text that will promote the mission of your project.
It will be important to evaluate the level of machine-readability required for your project. If, for instance, you need text files that will be read by a search engine to support full text searching, then text files produced by an OCR application may be sufficient. You may find, however, depending upon your source material, that even the best OCR output will require some manual monitoring or correction in order to produce satisfactory search results. Making this determination will require careful analysis of the level of search accuracy you require and the practical implications for achieving various levels of accuracy. Is it necessary, for example, for a scholar to find every occurrence of the search term? Are some sections of the text more critical for searching than others? The answer to these and comparable questions will help to determine the best approach. Similarly, you may determine that you wish to support features beyond full text searching and these may require a higher level of machine readability. In order to broaden search options, for instance, you may wish to apply SGML, XML or other tagging to support the focused searching of key elements or sections of the text. Finally, you will want to decide if you will display the OCR text files to users. Should your project include display, you will want to consider what this might mean for the level of accuracy required for OCR output. You will want to determine what level of tolerance the user may have—or may not have—for errors in the OCR text files. An understanding of the quality expectations of the users and how the displayed files may be used will be helpful in this analysis.
-
Understand the characteristics of your source material.
As noted above, the quality of the paper and the characteristics of the printed source material including font, layout, and graphical elements, can impact the overall quality of the OCR output. Understanding the full range of features present in the source will allow you to determine the most efficient way—or even whether—to employ OCR. If, for instance, your source material contains a large number of full-page graphics or includes a large number of special characters or scientific or mathematical symbols, you may wish to develop an automated means for filtering these pages out for special OCR or other treatment. Similarly, you may wish to filter those pages for which a non-English dictionary will be most appropriate or for which more or fewer engines are required to achieve the desired accuracy level.
-
Develop appropriate quality control measures.
With a clear understanding of the source material characteristics, it will be possible to establish text accuracy targets that will successfully support the mission of the project. But in order to give these targets meaning you will need to establish a quality control program to insure that established targets are met. Quality control measures may vary from a complete review of all phases of text production to a sampling approach, in which only a subset of pages is selected for review. If a complete review is cost-prohibitive, as it likely will be for many projects, it may be important to retain a statistical consultant to insure that the sample drawn for quality assessment is a valid representation of the overall data population. It will also be important to have established procedures in place regarding any data errors that are found. Will these errors be corrected by in-house staff? Or, if the work was produced by a third party, will the re-work be done by the vendor? How will quality standards be communicated to the staff involved? Whatever the approach to a quality control program, it will be important to consider carefully the staffing, budget, and space implications of the work.
-
Understand the impact of scale.
It is very important to realize that the solution that is appropriate for a short-term project of 10,000 pages may not be appropriate for a longer-term project of 10 million pages. Even for short-term projects, it may be that the approach which works smoothly in the start up phase begins to break down as the project ramps up in scale. It is important to be sensitive to the impact of scale and to take this into account in project schedules and budgets as changes in scale can have a dramatic impact on both.
-
Carefully assess the impact of location.
Production of OCR can be, and frequently is, contracted to a third-party vendor; however, outsourcing may not be appropriate for all projects. In considering whether to outsource or to retain this process in-house it is important to look at a variety of questions. You will want to ask which approach makes the most effective use of hardware, software and staff. Which approach provides the greatest value—as distinct from offering the lowest cost? How important is it to retain local control over all aspects of the process? What impact will each approach have on key production schedules? What skills can an outside vendor offer, and what benefit is lost to the project by not having these skill sets locally? These and similar questions are important to consider when choosing whether or not to outsource this process. Again, there is no single right or wrong approach; the best solution will vary project to project.
However, addressing these and similar questions will help to develop an understanding of the comparative advantages that each approach may offer. It may be important, for instance, to produce the highest quality output at the lowest possible costs. This may suggest that outsourcing the work is the best approach. But this requirement for high quality at low cost may have to be balanced, for example, with a need to build skills locally that might be essential for other reasons. For instance, it may be important to be able to leverage knowledge across multiple projects or to help to seed other initiatives at the home institution. This need may—or may not—outweigh the importance of maintaining the lowest possible cost. An example from JSTOR's approach to imaging may help to illustrate the point. At JSTOR we have made a decision to outsource the scanning of our journals because scanning is a very manual process with very clear, measurable specifications. We can and do have various objective quality measures that we routinely apply to the vendor's work. However, we have chosen to retain in-house the pre-scanning preparation work and the post-scanning quality control because through these processes we build an important knowledge base about the journal content and because quality assessment is so very key to the mission of the archive.
-
Time—It will require more than you expect.
As a general rule and despite your best planning efforts, it will take more time than expected to produce machine-readable text. This may hold true for both the pre-OCR steps and the actual running of the software. The time required to run the software can vary greatly depending on the characteristics of the source material, the specific software, the number of engines, the hardware running the process, and the quality control measures developed for the project. Even within a single printed work, significant deviation can occur as the content varies. Running a representative test sample of the source material can be helpful, but know that it will not fully reveal the complexities of ongoing production processes. Working with the journals in the JSTOR archive whose content varies over time and discipline, we see significant variation in software processing time. A single "dataset" of approximately 5,000 pages may require from 10 to 30 hours of processing time on our particular software/hardware configuration. It is helpful, especially in the initial days of the project, if contingency time can be included in production schedules.
-
Consider the implications of project duration.
Given the speed of technical innovation for hardware and software, it is important to take into account the expected duration of the project. It may be that the hardware and software applications that were the best choice at the beginning of a project may become dated and require reassessment as the project moves forward. And even if the original hardware and software applications continue to function, the project may gain important cost benefits from upgrading. Consequently, a careful, periodically recurring evaluation of new developments and technologies is recommended for projects of all but the shortest duration.
-
Costs will be higher than anticipated.
While it may be relatively easy to project costs based on an initial run of sample data, it is difficult to anticipate how actual production costs may differ from estimates made from these pilot data. Typically, production costs are higher than expected, especially in the early days of a project. Machines may not perform at the speed advertised or as tested; drives may fail; software upgrades may be needed. As the scale of production increases, additional purchases may be needed to reach the desired production level and to maintain server response time within an acceptable range. Similarly, as project scale increases more staff time is required to tend to systems administration and developing a technical infrastructure suited to the operation. It is helpful to anticipate the unexpected in these areas and to build into project budgets, if possible, some ability to respond to these unforeseen needs as they arise. The actual variance will be different for each project of course. A helpful—though quite rough—rule of thumb may be that if your expectation is 1, you should allocate 1.5.
Considering each of these factors may help in identifying the best approach to producing machine-readable text for a particular project. However, each of these elements must be examined in the context of the overall mission of the project. It is only in this context that the best method that offers the fullest level of support for the goals of the project within the constraints of time and budget can be identified.