Fourth Generation Collections: TEI, FRBR, and Canonical Text Services

(Gregory Crane)

Classicists still depend upon resources such as the Thesaurus Linguae Graecae digital library of classical Greek and the Packard Humanities Institute Latin Text CD, both of which predate the TEI. These collections contain minimal semantic markup and enode the basic page layout. The Perseus Digital Library began work before, but was an enthusiastic supporter for, the development of the TEI and added substantial semantic markup to its texts, providing a second generation of collection strategy. In the 1990s, projects such as the Making of America and JStor introduced a third generation of collection that combined industrial image scanning, library metadata and light TEI encoding of OCR-generated text.

We are now beginning to develop fourth-generation collections, which integrate carefully curated TEI transcriptions with often elaborate markup along with OCR generated text from much larger collections. In a fourth generation collection, a single TEI transcription can serve as a model against which many other editions can be checked, corrected, collated and receive initial, automatically generated tagging. The Cybereditions Project, funded by the Mellon Foundation, is beginning work on a fourth generation collection for classics. This new collection is using the Open Content Alliance infrastructure to digitize editions for every major Greek and Latin author. For the first time, classicists will have access not only to transcribed texts but to the full page images and scholarly apparatus of scholarly editions for every major classical author.

Within this collection, we will use existing TEI-encoded editions of Greek and Latin for automated error correction and collation as well as for markup projection. Where we do not have access to TEI-encoded transcriptions, we will correct multiple editions against each other. We will also test methods with which to project markup from tagged editions to texts generated by OCR.

Fourth generation collections immediately demand that we augment the TEI in two ways.

First, we need to be able to manage not only multiple editions but also versions (e.g., English translations of Greek) and derivatives (commentaries, specialized lexica, indices) of canonical works. We originally developed the concept of Abstract Bibliographic Objects to represent relations of this type but, when the Functional Requirements for Bibliographic Records appeared, we reorganized our bibliographic data to be FRBRcompliant. The FRBR records allow us to create truly comprehensive bibliographic databases, allowing us to move beyond the single-edition checklists on which the TLG Canon of Greek authors and the PHI Latin list of authors and works were based. We now have an infrastructure in which we can associate dozens and even hundreds of versions that have appeared not only in print but in MS form.

Advanced work in classics, however, demands that we go beyond the level of the book and encode as well the canonical citiations (e.g., book/chapter/verse) on which classical scholarship depends. For this we can use TEI encoding within the text but the emerging Canonical Text Services protocol allows us to make collections interoperable even when these use different TEI-based markup to encode the same citation schemes.

TEI

Members Meeting 2008

Fourth Generation Collections: TEI, FRBR, and Canonical Text Services