Electronic Textual Editing: Document Management and File Naming [Greg Crane]
Contents
- Introduction
- Preparing a TEI text for the PDLS.
- Adding a new file to an existing collection
- Processing a document in the PDLS.
- Information Extraction: Places and Dates.
- Conclusions.
Introduction
This chapter describes some of the things that happen to a TEI document when it enters into one particular environment, the Perseus Digital Library System (PDLS). The PDLS represents a middle ground between powerful and domain-specific systems and simpler more general digital library systems. While the PDLS project is appropriate for many types of data—including plain text, PDF, HTML, and RTF—Elli Mylonas ensured that the Perseus Project saw structured markup as a fundamental technology when work was first planned in 1986. Structured markup allows the systems that mediate between documents and humans (and, indeed, between different documents) to make more intelligent use of those documents. The code that implements the PDLS and the actual PDLS system are, in our view, secondary phenomena. The PDLS is significant in that it shows concretely what functions one evolving group of humanists felt were valuable and feasible. The PDLS reflects a cost-benefit analysis that went beyond theory. The PDLS is not an edition but a tangible interpretation of how editions can, given the limited technology and available labor power, interact with other electronic resources to serve a wide range of audiences.
The following sections quickly describe the metadata that the PDLS automatically extracts from TEI documents. This discussion alludes to, but does not describe, the various services that operate upon the metadata we produce. We will set that discussion aside both for reasons of space and because the metadata level is useful in and of itself. While the current formats and content of the metadata files are still rapidly evolving, we expect that digital library systems will harvest both the TEI documents and accompanying metadata, creating their own front ends, visualizations and metadata.
Preparing a TEI text for the PDLS.
Scalability is crucial to any digital library and many of us have supported the TEI from the start because we saw it as a mechanism that would help future generations of librarians preserve and maintain electronic documents. The TEI should meet the need for librarian effort to grow far less quickly than the number of documents: e.g., increasing the number of documents by a factor of 100 might require twice as much labor; 1,000 three times as much; 1,000,000 six times as much etc.
In practice, the TEI has made it quite easy for us to ingest substantial collections with minimal effort. We have been able to add substantial collections of third party TEI texts to the PDLS with minimal effort. American Memory collections from the Library of Congress initially required a few hours of preparation and more recent set up time has been measurable in minutes.
The American Memory collections represents a best case scenario, since they consist of standard document types (books) and follows fairly consistent editorial practices. While we currently manage a variety of document types (dictionaries, commentaries, grammars, encyclopedias, catalogues, pamphlets), complex documents with very precise presentation schemes can require substantially more work to represent on-line.
The New Variorum Shakespeare (NVS) Series, for example, has a very precise look and feel that goes back almost to the American Civil War. Creating densely tagged TEI versions of NVS editions in a prototype for the NVS editorial board was relatively straightforward— the TEI had adequate expressive power to represent the semantic structures that NVS editions record. We found the process of converting the TEI files into an HTML format modeled on the NVS stylesheet to be far more complicated and frustrating, with messy tables within tables and other formatting hacks. On the other hand, if we had simply wanted to develop an electronic representation of the information in the TEI files and had not been compelled to follow the elaborate typographic and page layout conventions of the print NVS, the task would have been much different and, we suspect, simpler.
Adding reference metadata to a TEI file
Many files are quite large and users will often prefer not to see them in their entirety. The reader looking up a word in a dictionary probably does not want to download 40 megabytes of data to read a single dictionary entry; the reader of Shakespeare may at different times wish to view a line, a scene, or an act as well as an entire play or an arbitrary extract.
Document management systems need to know how they can divide a document. Some tagging schemes (e.g., page breaks) can provide reasonably natural units, but not all documents divide in such logical units. Even if we choose to impose hierarchically numbered divisions ( <div1> , <div> etc.), it is not be obvious which level of the hierarchy should serve as the default unit: where the highest level division may be too large in some cases, the smallest unit may be too small.
The PDLS would break the document up into three chunks, one for each unit with the attribute value chapter, despite the fact that two of these chunks represent a <div> and one represents a <div1> .
Displaying the contents of the TEI file.
The PDLS has a default program that displays many texts in a reasonable fashion. This program is written in the COST XML conversion language. Editors can write their own specifications that build on the default specification in COST or they can create their own style sheets in XSLT, DOM or some other specification language.
Adding a new file to an existing collection
The <rdf:Description> element provides an identification scheme for this file. Notice that the <rdf:Description> tag itself contains a unique identifier for the document which it describes. We therefore can, if we choose, keep this particular entry fairly succinct. A separate file can link this document to a full MARC record prepared by a professional cataloguer. The individual adding a document to the collection can thus provide minimal data needed by the system in order to add a document.
The identification scheme describes the source of the document (Perseus), its basic type (e.g, text) and a unique identifier. In our case, the identifier consists of a two-number collection identifier (thus, the Civil War collection was collection 5 for year 2001) and a serial number within the collection for each document. The format of the identification number can, however, be arbitrary—the main point is that the number must uniquely describe a particular document.
The <dcterms:isPartOf> element assigns this document to a particular collection. The <rdf:resource> attribute describes the relative location of the source file. Many different machines contain versions of the PDLS and the address is relative to a root directory within the Perseus Digital Library System.
The <figures> element describes the location of the full resolution scans of illustrations for this particular book. Since illustrations take up a great deal of disk space, the source images tend to live on a central server and we thus provide an absolute pathname. If we wished to include this image source data in the core DL data distribution, we could provide a relative path.
Note that we have a separate <pages> element. This describes the location of page images for a book.
Processing a document in the PDLS.
Basic display and browsing.
- Call up an HTML version of the document by its unique identifier: e.g., asking for 2001.05.0007 would, according to the above example call up the first volume of Battles and Leaders of the Civil War.
- Provide a table of contents for the document.
- Support the ability to page through a document by its default chunk (e.g., chapter, page etc.)
- Allow the system to override the default and chunk on some other unit (e.g., view the document as individual pages rather than as sections or chapters).
- Support both interactive browsing and explicit URLs. Individuals should be able to move through a document interactively, while third parties should be able to generate links to particular sections of the document. At present, URLs can address any defined chunk (e.g., a URL can produce page- or chapter- sized chunks).
- Submit a URL and return an unformatted well-formed fragment, allowing a third-party system to format and/or analyze the XML source. We consider this feature to be critical, since it makes it possible for multiple systems to apply a wide range of analytical and visualization techniques to the data that we manage.
This core level of functionality does not at this point provide the ability to select an arbitrary fragment of a text: e.g., ‘Shakespeare, Julius Caesar, 'O, pardon me, thou bleeding piece of earth ... groaning for burial'’ should resolve to Antony's full speech over Caesar's corpse in the play. The system should be clever enough to return this chunk in editions with different citation schemes (e.g., the Globe vs. the Riverside Shakespeare) and with substantive editorial differences (e.g., original vs. modern spelling).
Subsequent functions of the PDLS automatically perform citation mapping, but this function needs to be developed more fully.
We do not consider text retrieval at this stage, although this is clearly a core function for any document or document collection.
Processing datafiles.
Convert SGML to XML.
While we support SGML and XML, the PDLS works internally with XML. All documents are converted to XML before subsequent processing. Internally, we maintain separate directories for each collection and separate XML files for each document. Following the example of the BLCW, this step scans war/blcw01.sgml and creates texts/2001.05/2001.05.0007.xml and this file becomes the basis for all subsequent processing.
Extract core metadata from the XML file.
This converts a variety of data into a tab delimited field. These include Dublin Core fields (e.g., Creator, Title, Date, Type) from the TEI header but also other categories of data from the body of the text. BLCW vol. 1, for example, contains hundreds of illustrations. We extract the unique identifiers and captions for these figures generating records which indicate that document 2001.05.0007 contains (for which we use the Dublin Core relation HasPart) a particular object (e.g. 2001.05.0007.fig00017) with particular textual data associated with it (e.g. ‘Charles P. Stone, Brigadier-General, (From a photograph.)’).
Data generated at this stage is stored in 2001.05.0007.met in the same directory as the 2001.05.0007.xml file.
Aggregate the metadata for the PDL.
Metadata for individual documents can live not only in the collection description file, the TEI document header and the entire TEI document but also in other locations: full MARC records may, for example, be harvested from an OPAC. All the relevant metadata is collected and stored in the ptext database. We generate a tab delimited text file (ptext.db) that we read into whatever RDBMS we happen to be using on a given Perseus system (at present, we alternate between Postgres and MySQL).
Generate the lookup table: a list of valid citation strings for the XML file.
This contains a table of entry points into the document. Note that this is more involved than it may initially seem. The table allows us to divide the document into a variety of different chunks at varying levels of granularity. The table must not only be able to support random access but also identify various methods to divide the document.
Suppose we access a document by page number. We may wish to display the page, or we may wish to display the entire chapter of which that page is a part—or we may wish to determine how to chunk the document at runtime.
Chunking schemes do not necessarily follow a neat hierarchy. Speeches in the Greek historian Thucydides, for example, are very useful units of study but they often begin and end in the middle of the conventional book/chapter/section citation scheme. We can use the lookup tables to support overlapping hierarchies, addressing a well-known drawback of BNF style grammars such as SGML/XML.
From citations to bidirectional links.
Web links are monodirectional: a link goes from document A to document B, but not from document B back to document A. The monodirectional nature of Web links makes the Web a directed graph and has profound implications for the topology of the Web. In digital libraries, however, we have greater control over the content and we can track links between documents. More importantly, long before computers were invented many formal publications developed canonical citation schemes that gave print citations persistent value: there are various ways to abbreviate ‘Homer’ and ‘Odyssey’, but Hom. Od. 4.132 described the same basic chunk of text in 1880 and 1980. Not all disciplines have respected persistence of reference (Shakespearean editors, for example, regularly renumber the lines of new editions, thus making it difficult to determine the precise reference of an act/scene/line reference unless one knows the precise edition being used), but those disciplines which developed consistent citation schemes have a major advantage as they seek to convert print publications into electronic databases.
Persistent citation schemes cover multiple editions of the same work. Thus, when the editors of classical texts decide that the lines of a poem have been scrambled during textual tradition, they may reshuffle them into what they think to be the logical order. In a classical edition, the lines may change places but their line numbers remain the same. We find instances where line 40 precedes line 39. Such shuffled passages produce odd citations and complicate the systems that manage such citations, since we cannot assume that line numbers always increase. Nevertheless, this consistency of naming means that line 40 always points to the same basic unit of text, wherever the editor of a particular edition may choose to locate it.
Note, however, that, if consistent reference schemes cut across editions, the text chunks pointed to by these citations will vary— sometimes in considerable degree—across editions. Persistent citation schemes are fuzzy and this fuzziness gives them flexibility.
The PDLS uses the concept of an ‘abstract bibliographic object’ (ABO) to capture the fact that a single work may appear in multiple editions. Thus, we can declare two documents to be versions of the same text. The versions can be variant editions of a source text (e.g., Denys Page's edition of Aeschylus' Agamemnon vs. that of Michael West) or a source text and its translation. In some cases, ABOs may reflect a loose affinity: citations to an original spelling edition of Hamlet based on the First Folio and using the through-line-number citation scheme (a single line count running throughout the play) are very different from those to a modern spelling edition with act/scene/line references. In other cases, text alignment may be approximate since the word and even clause order of translations will often differ. Nevertheless, ABOs allow us to provide a powerful organizational tool.
At present we use ABOs to perform two kinds of organizations. First, ABOs allow the PDLS to aggregate versions of an overall text in a reasonably scalable fashion. Once we link a given document to a particular ABO, the digital library system can then automatically make this resource available. In practice, when the reader calls up an electronic version of the Odyssey, the new resource can show up in the list of options. We could use ABOs to link partial versions of a text. If a translator publishes a version of a particular Pindaric Ode or of the Funeral Oration of Pericles, readers of that ode or of the funeral oration would see the extra resource. The ability to mesh overlapping chunks of texts raises interface issues (e.g., how do we keep from confusing readers when they find translations by X for some poems but not others?).
ABOs are arguably most exciting when they allow us to convert individual citations into bi-directional many-to-many links. Consider a comment on a particular passage of Vergil. A commentator attaches to ‘arma virumque cano’, the opening words of the Aeneid the annotation ‘This is an imitation of the opening of the Odyssey, a)/ndra moi e)/nnepe. ’ The system looks for Vergil Aeneid line 1 and then searches for the phrase ‘arma virumque cano’. It can create a link from the source text back to the commentary. The reader who calls up the opening of the Aeneid sees the link back to the commentary. The link can be privileged (e.g., we are looking at a particular commentary on the commentator's own personal edition) or general (e.g., we link any comment on ‘arma virumque cano’ in Aen. 1.1 to any edition of Vergil). Clearly, this service raises interesting problems of filtering and customization as annotations encrust heavily studied canonical texts, but we view such problems as necessary challenges and the clusters of annotations on existing texts as opportunities to study the problems of managing annotations.
The consequences of such linking are potentially dramatic. The Liddell Scott Jones Greek Lexicon, 9th edition, comments on c.200,000 particular passages in 3,000,000 words of classical Greek: individual comments directly address roughly one word in fifteen. Some readers will discover these comments on particular passages—though the numerical majority of those reading Greek at any given time are probably intermediate students who are hard put to find a single citation buried in larger articles. In this environment, however, the lexicon can become a commentary: i.e., the readers of a text can see the words that LSJ comments upon.
The long term consequences of converting citations in bi-directional links are intriguing: not only can an online lexicon become a continually updated database but the individual entries—and indeed all publications about words— increase in value when their visibility increases. Nevertheless, the opportunities raise the same challenges of information overload and need for filtering as with individual commentary notes.
Indexing textual links within Perseus.
From a practical perspective, citation linking looks for two sources. First, it scans for documents which contain explicit commentary notes on particular texts. This data lives in the lookup table generated previously (*.lut) and contains the Dublin Core relation IsCommentaryOn followed by an ABO. The second source consists of citations scattered throughout individual documents as <bibl> or <cit> references. Note that individual documents contain both categories of links, since explicit commentary notes often contain citations to other passages as well.
Information Extraction: Places and Dates.
Some literary works occupy spaces that are designedly amorphous and have no precise moorings within time and space: Greeks took pleasure in locating the adventures of Odysseus in various historical locations but the poem surely assumes a never-never land of gods and monsters. Many historical documents, however, locate themselves in very precise times and places: Dickens' novels are fictional and lack precise times but they are located in a London with real settings. Historical documents often point to very rich information about the time and places to which they allude. The places and dates which documents cite often offer important clues as to their content. Timelines and maps that plot the dates and places in a set of documents can provide not only browsing aids but information about the structure and nature of the documents. Users often explore topics that are structured by time (e.g., locate information about Worcester County in Massachusetts during the 1840s). We want to be able to construct geospatial queries, perhaps by selecting sections of a timeline and a map.
While we can manually tag all places and dates in a document, such manual tagging is not feasible for very large collections. A fifty year run of a 19th century newspaper is far larger than the corpus of all surviving Elizabethan and Jacobean drama, but the labor needed to edit such a corpus probably does not equal that necessary to edit a single major Shakespearean play. Information extraction seeks to automate the process of identifying people, places, things and the relations between them. Such automatic processes are never perfect—they seek to maximize recall (trying to find everything) and precision (trying not to collect false positives). Nevertheless automatic processes can provide data that, for all its imperfections, provides a true image of a document's content and offers a starting point for those editors who wish to provide more accurate results.
The citation extraction described in the previous section represents a simpler information extraction function, but information extraction can include many tasks. Automatic syntactic analysis, for example, is a very complex function in natural language processing that has great potential for scholarship that focuses on language. A syntactic analyzer can create a database of parse trees which identify, among other things, the subjects and objects of verbs. The output of automatic parsing is imperfect and will vary from corpus to corpus, but imperfect scalable analysis of large bodies of data can reveal significant patterns. Without syntactic analysis we could, for example, determine that dog, bite, and man are related but we could not determine whether man or dog were more commonly the subject.
Information extraction tends to be hierarchical, with more tractable tasks serving as the foundation for further operations. Thus, a morphological analyzer (which can recognize the form and dictionary entry of inflected words) is often a necessary component for a syntactic analyser. By recognizing people and places we can identify relations between the two: e.g., ‘General Grant at Vicksburg’ contains not only references to a person and a place but to a relation between the two.
Information extraction also tends to be domain specific. Identifying chemical compounds raises problems that differ from identifying military units (e.g., recognizing ‘1st Mass.’ and ‘First Regiment Massachusetts Volunteers’ as references to the same thing). Place names are common elements of human language, but place names in the Greco-Roman world are much easier to identify than those in the United States. Greek and Roman place names do not overlap as often with the names of people and things (e.g., ‘Christmas, Arizona,’ ‘John, Louisiana’): Greco-Roman place names are semantically less ambiguous. Also, Greek and Roman place names have a much better chance of being unique: there are relatively few names such as Salamis (which can describe either an island near Athens or a place in Cyprus) and none so ambiguous as Springfield or Lebanon, which are each the names for dozens of places in the United States.
Members of Perseus have created programs to identify specific entities for particular collections (e.g., monetary quantities in dollars and pounds, Anglo-British personal names, London street names, US Civil War Military Units). At present, these tags are added to documents before they enter the PDLS. Ideally, the PDLS (or a similar system) would allow collections to share information extraction modules more smoothly. The Generalized Architecture for Text Engineering (GATE), developed at Sheffield, provides one model of how to integrate complementary information extraction modules and may point the way for digital library systems that incorporate these functions as a matter of course. The generic PDLS contains routines to identify references to places and dates. We have chosen these two categories because generic programs can produce reasonable (if varying) results for both categories. We have not yet added a module to identify personal names (though internally we do look for clues such as ‘Mr.’ to distinguish ‘Mr. Washington’ from places such as Washington D. C. and Washington State).
Extracting Places
We scan all xml files for possible place names. The scan asks three major questions. First, has it found a word or phrase that is a proper name? Second, if it is a proper name, is it a place or something else (e.g. Washington = George Washington?). Third, if it is a place, which place is it (e.g., is it Washington, D. C. or Washington state)? Each of these questions introduces its own class of errors, but results range from well over 90% accuracy for Greco-Roman place names to roughly 80% accuracy for US place names. Even the relatively noisy geographic data generated from texts describing the US allow us to identify the geographic terms for most texts.
The geographic data is stored in files with the extension .ref. These files associate particular instances of a placename in a given text with various authority lists such as the Getty Thesaurus of Geographic Names (TGN). Thus, we associate references to Gettysburg with tgn,7014060, which describes the town of Gettysburg in Pennsylvania. The TGN number allows us to look up longitude-latitude data with which we can plot Gettysburg on a map. We currently combine geospatial data that Perseus collected for Greco-Roman sites with TGN data. Other data sources can be added easily to this scheme.
Extracting Dates
We scan all xml files for dates. In practice, dates have proven much easier to identify than place names. In part, in most documents the majority of four-digit numbers without commas (e.g. 1862 rather than 1,862) are dates. Furthermore, many dates have easily recognized patterns (e.g. "MONTH", " ONE OR TWO DIGIT NUMBER", "FOUR DIGIT NUMBER").
Problems do occur. Some texts have many isolated numbers such as 1875 that are not dates: in documents that use the Through Line Number scheme to reference Shakespeare, four digit numbers such as 1875 often refer to lines in the play. Thus, general strategies must be more conservative than those aimed at a particular collection (e.g., collections where it is more effective to assume any number between 1600 and 2000 is a date vs. those where it is not).
Nevertheless, we found that we were quickly able to mine full text for useful date information. We were then able to generate automatic timelines that strikingly captured the chronological coverage and the nature of individual documents: linear histories yield distinctive timelines, in which the dates slope downwards. Likewise, we can see temporal emphasis in catalogues and more hypertextual documents that do not follow a consistent narrative (e.g., a city guide that tells the story of various buildings, constantly jumping back and forth in time): some London guidebooks show spikes of interest in the 1660s, reflecting the fact that many of the buildings described were damaged or destroyed in the fire.
Because the Perseus Digital Library contains many documents about the Greco-Roman world, the PDLS provides reasonable support for AD, BC, CE, and BCE style dates.
We have used the <date> element. All dates are converted into a standard form using the value attribute. Thus, ‘June 5, 1861’, ‘the fifth of June, 1861’ and ‘6/5/1861’ are all stored as value=1861-06-05. We use the <dateRange> element to capture date ranges as well. For each XML file, we create a .dat file in which to store extracted date information in an exportable tab delimited form.
Using Places and Dates to identify events
Once lists of places and dates are available, it is possible to look for associations between the two to identify significant events. David Smith (Smith 2002) reports the results of preliminary research on event identification in a heterogeneous digital library. He found that he could identify many significant events, with major battles being particularly easy to find. We could (although we do not yet) generate additional metadata files listing significant place/date collocations as a part of the standard PDLS. We mention such potential metadata to illustrate how we can use named entity identification to elicit relationships between entities (e.g., the notion that something significant occurred at a given place on a given date). Furthermore, date and place collocation detection provides an example of the kind of service that probably belongs in a general digital library system. Domain experts then have the option to refine or reconfigure the date/place collocations: e.g., one might develop heuristics to identify particular event classes such as experiments or speeches.
Conclusions.
- First, the separation of content from presentation facilitates and encourages multiple front-ends to the same content. Since electronic editions can have useful lifespans that extend over decades, if not longer, editors need to assume that librarians or collection editors will exercise substantial control over the presentation of their work.
- Second, individual editions will benefit if they can be treated not only as distinct units but as parts of larger collections: e.g., an edition of Macbeth should interact with other plays by Shakespeare, all Elizabethan and Jacobean drama, and all on-line dramatic texts in any language. The more powerfully their editions interact with documents that will accumulate over the coming years and decades, the more useful the individual editions and digital libraries will be as a whole.
- Third, while the sacred hand of the editor may determine every byte in the original source file, digital library systems will probably generate far more tags in associated standoff markup than are present in the original source text. This phenomenon is already visible in most texts within the Perseus digital library, where information extraction routines add <persName> , <placeName> , <date> , and other elements.
- Fourth, the digital library systems that mediate between edition and audience can generate new services and new audiences which the original editor did not anticipate. Informal evidence suggests that the Perseus digital library has made many Greek and Latin materials serve a diverse, geographically distributed collection. Other collections report similar jumps in the intensity and breadth of usage following open electronic publication.
The practical implications of these conclusions are immense but frustrating. We have not yet established practical conventions for electronic editions—nor are such conventions likely to assume a stable form in the near future. The structures that we add to our documents reflect elaborate (if often unconscious) cost/benefit decisions not only about the interests of our audience but about how future systems will shape and enable those interests. Consider the simple example of place names. We can already identify and disambiguate 80% of the placenames in highly problematic documents about the United States. These results are good enough for the purposes of generating rough maps that document the geographic coverage of a document. Nevertheless, editors can fix erroneous tags and generate a clean text that will provide much more precise geospatial data. Given the rise of Geographical Information Systems (GIS), electronic editions where place names are not tagged and aligned to a major gazetteer may stand at a substantial disadvantage. Similar issues arise surrounding syntactic and semantic information, where editors may find themselves including parse trees or semantic categorizations that integrate their editions into much larger frameworks.