A TEI-Based Tag Set for Manuscript Transcription

Abstract

This document describes ds3.dtd, a set of XML tags for the transcription of medieval manuscripts, for use either with excerpts (e.g. transcriptions of individual pages) or with a full transcription of an entire manuscript. A guide of this scope cannot anticipate every transcriptional need for manuscripts; however, it does define the XML elements suggested for use, and attempt to illustrate their use. All of the documentation files are included in the ds3.exe package, available here, and are linked by a table of contents.

This tagset conforms to the Text Encoding Initiative (TEI) P4 Guidelines; the presentation and selection of elements have been influenced as well by discussions with representatives of the Nordic Network, and by David Mackenzie's Manual of Manuscript Transcription for the Dictionary of the Old Spanish Language (5th ed. rev. and exp. Ray Harris-Northall).

For more information on TEI, and to find an online version of the TEI Guidelines, please see http://www.tei-c.org.

The SGML "beta release" ds.dtd was created in 1998 by Michael Sperberg-McQueen for Digital Scriptorium. In 2001, David Seaman created the ds2 package, including a sample style sheet, batch files that call James Clark's sp parser, and a NoteTab tag library. The XML ds2.dtd consisted of the core TEI.2 tagsets plus elements from the additional tagsets "linking", "figures", "analysis", and "transcription". Based on this work, ds3 (2002) adds a few elements from the TEI Medieval Manuscripts Working Group, as well as a wider variety of sample texts and revised documentation.

Overview

Identification (TEI Header)

Characters and Abbreviations

The Work

The Written Text

A Note on Attributes

Alphabetical List of Element Types

Overview

Levels of Transcription

Several levels of transcription may be distinguished, which vary in the level of detail they preserve from the manuscript, and which are adequate for different purposes.

Literary: a transcription is adequate for literary purposes when it presents the wording of the text preserved in a manuscript with enough fidelity to allow a user of the transcription to know what language the manuscript is in and read the work(s) it contains.
Illustrative: a transcription is adequate for illustrative purposes when it represents the manuscript in a form suitable for display with an image of the manuscript, e.g. so that readers can consult the transcription for help deciphering difficult passages or unfamiliar abbreviations.
Philological: A transcription is adequate for philological purposes when it records the words of the text in enough detail to support text-critical, orthographic, or lexical investigations.
Paleographic or codicological: A transcription is adequate for paleographic or codicological purposes when it records paleographic or codicological detail completely enough to allow paleographers or codicologists working with the manuscript to search in the transcription to find phenomena of interest to them.

These levels of transcription form a series from literary to paleographic or codicological: each level typically includes all the information present at lower levels, and adds further information. This is a slight oversimplification: lower levels include less information about the manuscript, but may include other information not included in higher levels. A transcription for literary purposes, for example, may provide normalized forms of some words and expansions of most abbreviations, without recording the details of the manuscript form; transcriptions for philological and paleographical purposes will document the forms actually in the manuscript, but will not necessarily provide normalized equivalents. Sometimes, that is, different levels of transcription provide the same kind of information in more or less detail, and sometimes they provide slightly different kinds of information. For convenience, we will refer to literary transcriptions, illustrative transcriptions, philological transcriptions, and paleographic (or codicological) transcriptions (or, for even greater brevity, L-transcriptions, I-transcriptions, P-transcriptions, and PC-transcriptions).

A literary transcription must reproduce the lexical items of each text, in an appropriate sequence. Variations in hand, ink, orthography, abbreviation practice, and the like need not be recorded. Abbreviations may be expanded silently. Scribal or editorial alterations to the text (insertions, deletions, alterations) may similarly be passed over in silence. Complications in the sequence of words (e.g. transpositions, or insertions with an uncertain point of insertion) may be resolved without leaving any trace in the transcription. The logical structure of the literary work, on the other hand, must be preserved as presented in the manuscript: logical divisions must be marked, headings must be transcribed as headings.

An illustrative transcript should include all the information present in a transcription for literary purposes; in addition, it should record the foliation and column boundaries of the manuscript, for ease in moving between manuscript image and manuscript transcription.

A transcription for philological purposes should record linguistically relevant details of scribal practice: the orthography and graphemic inventory of the manuscript should be followed closely; abbreviations, insertions, changes, deletions, erasures, and disturbances in sequence of material should all be registered. Variations in the forms of letters or abbreviations will not typically be captured with complete detail, but when two letter forms are sufficiently distinct that they might give rise to distinctive misreadings, those letter forms may usefully be distinguished. The registration of abbreviations should make clear what letters are given in the manuscript and what letters are supplied editorially.

A transcription for paleographic and codicological purposes cannot, in the nature of things, replace a facsimile reproduction of the manuscript, any more than a facsimile can fully replace the original. But a paleographically and codicologically exact transcription can allow the systematic search of the manuscript for phenomena of interest. Such a transcription should record details of ink, hand, illumination, and gatherings so that they can be searched for later.

The main focus of this document is the definition of a simple set of rules for illustrative transcriptions and philological transcriptions. The Digital Scriptorium Project, which commissioned this work, intends its transcriptions primarily to accompany images of sample manuscript pages; they will serve to make manuscript materials accessible for further study. The rules for I-transcriptions described here are intended to make the transcription work relatively simple, and to ensure that the transcriptions provide a sound basis for further work and elaboration of the transcription, in cases where the manuscript and scholarly interests warrant the extra labor of P- or PC-transcriptions.

The Organization of This Document

This document assumes a basic knowledge of SGML or XML, and uses the tags of the Text Encoding Initiative (http://www.tei-c.org). It lists the elements which should be used for illustrative and philological transcriptions, and discusses some questions that arise in transcription of manuscripts.

First, the header is discussed, in which the manuscript and the transcription should be identified and described. Then the element types recommended for use in I- and P-transcriptions are described, "from the bottom up": beginning with the transcription of individual characters, progressing through the features associated with the work in question (and needed in L-transcriptions), the features associated with the manuscript page, the scribe, and so on.

Identification (TEI Header)

Virtually all the elements of the TEI header may be used in manuscript transcriptions; at a minimum, those described here should always be supplied.

Manuscript

Within the <sourceDesc>, <bibl> and <msIdentifier> elements together should give a complete citation of the manuscript, with:

author(s)
title(s) of work(s)
city (<settlement>), library (<repository>), and collection
shelfmark (<idno>)

For I- and P-transcriptions it is recommended (and for PC-transcriptions it is required) that the hands in the manuscript be identified in the <handList> element within the <profileDesc> portion of the header. For further information, see below, section Scribal Hands and Hand Shifts.

Transcription

The transcription itself should be identified with at least the following information:

a title for the transcription (this may take the form A Transcription of [MS-name], if no other title imposes itself); this goes in the <title> element of the <titleStmt>, within the <fileDesc> section of the header
a list of those responsible for the transcription, and the nature of their responsibility (using <respStmt> to describe what each did)
date of publication (<date> in the <publicationStmt>)
publisher (<publisher>), distributor (<distributor>), or distribution authority (<authority>) who published / released the transcription
rights and permissions information and terms of access <availability>)

It is good practice to supply fairly full information about the encoding of the manuscript in the <encodingDesc> , and it may be desirable to provide keyword access or a classification of the manuscript or the texts it contains, for use in search interfaces (in the <profileDesc> section).

See Appendix for <teiHeader> examples from Digital Scriptorium texts.

Characters and Abbreviations

Letters and Symbols

In principle, transcriptions of manuscripts should contain all of the characters of the manuscript, in sequence. Putting this principle into practice requires determining what constitutes a character and what sequence to use when several are possible. The answers to these questions may vary with the transcription type.

All transcription types must distinguish at least the graphemes of the writing system, i.e. those graphic forms which are significantly distinct from each other in the sense that changing a symbol from one to another may change which word is written. In this sense, in modern English a "b" is graphemically distinct from a "c", because by substituting one for the other we can move from "bat" to "cat". In Old English, the characters we call eth and thorn are not graphemically distinct: they are allographs which vary freely.

Some pairs of allographs do not vary freely: in many manuscript and printed books in most Western European languages, long and short "s" are allographs in complementary distribution (short "s" in final position, long "s" elsewhere). In some cases, two characters may be graphemically distinct in some contexts, and allographs in others (in Middle English MSS, "c" and "t" are notoriously hard to distinguish, particularly in word endings corresponding to modern "-tion", where both spellings were common, and the difference in spelling carried no distinction in meaning).

For literary purposes, a graphemic transcription will suffice (and in fact, most literary editions intentionally level allographic distinctions in order to avoid distracting the reader).

For illustrative transcriptions, graphemic transcription also suffices: any reader wishing to know which of several forms was used in the manuscript can consult the accompanying image.

For philological purposes, it is normally useful to retain some, though not all, information about the actual forms used in a MS, as the forms may shed light on historical changes in orthography or on possible scribal misreadings (long and short "s" may each be misread, but they are likely to give rise to very different errors). The allographs of interest will vary with period and language, and no really plausible general rules can be set up for all medieval Western MSS; in general, though, if two common allographs of a letter have a dramatically different ductus or shape, they may be worth distinguishing for text-critical purposes, unless they are in wholly regular complementary distribution. Rare variants may be ignored, or lumped together as deviant forms.

For paleographic purposes, it may be desirable to make much finer distinctions among allographs, in order to study the usage of a scribe. No recommendations are made here as to the level of distinction appropriate for paleographic purposes.

Whatever level of transcription is being performed, the following procedure is strongly recommended:

keep a list of the graphic forms distinguished in the transcription (the current ASCII character set may be a useful beginning)
when some allograph pairs are to be distinguished, specify their distinctive features in the list of graphic forms; if upon consideration it is decided not to distinguish a pair of allographs, that fact should also be noted
for each allograph, indicate what character it is an allograph of
for each letter form (whether being distinguished or explicitly not being distinguished), it is useful to record sample shapes, either as graphic images or as paper sketches
if no ASCII character is suitable for transcribing one of the graphic forms being distinguished, then an XML entity should be defined for it; standard entity names may be found in many XML books, but there is no reason to insist upon standard entity names

Examples:
&longs;
<!ENTITY longs "s" >
&yogh;
<!ENTITY yogh "3" >
&dloop;
<!ENTITY dloop "d" >
&rround;
<!ENTITY rround "r" >
&wanglicana;
<!ENTITY wanglicana "w" >
&xanglicana;
<!ENTITY xanglicana "x" >
&eslig;
<!ENTITY eslig "es" >

Abbreviations

In literary transcriptions, abbreviations may be expanded without comment; it is customary, in philological and paleographic transcriptions, to mark the expansions of abbreviations as such, with brackets or font shifts. Some projects with paleographic interests record in great detail the specific form of abbreviation used in the MS; for philological purposes the main point is simply to make clear the difference in attestation between the letters present in the MS and those supplied as an expansion of the abbreviation.

Illustrative transcriptions might plausibly take either approach: expand abbreviations silently, or record the expansion explicitly. It seems best to record any expansions explicitly, by marking supplied letters using the <expan> element type. (Note that the examples in the TEI Guidelines sometimes show the entire word in an <expan> element; it is preferable to use <expan> only for those letters which would be printed in italics or enclosed in brackets in a conventional philological transcription.) If it is desired to record the fact of an abbreviation without providing an expansion, the <abbr> element may be used, but in fact there is no requirement that abbreviations be marked as such if not expanded.

For an example of how one can use these tags, see section Verse.

The Work

The logical structure of the work should be recorded, using

<body>
<div1>, <div2>, <div3>, etc., for formal divisions of the text
 for prose paragraphs
<l> for lines of verse
<head> for headings (often rubricated in MSS)

Examples that use these tags can be found in sections Prose and Verse.

The Manuscript Page

In illustrative and philological transcriptions, the details of the manuscript page can be presented in a fairly schematic form.

Foliation

Mark page boundaries using the <pb/> element; use the n attribute to give the page or foliation number: <pb n="185r"/>

Page Headings

Transcribe running heads using the forme work (<fw>) element, thus:

<pb n="185r"/>
<fw type="runhead">L<expan>IBRO</expan> DEL RELOGIO DEL ARGENT VIVO</fw>

Lines

Mark physical line breaks in a manuscript with the <lb/> element. To help track line breaks visually, we recommend placing the element consistently at the beginning of a line in your transcription, including the first line of a page or . See below, section Prose.

For verse lines, determine how the lines should be grouped, then transcribe them using <lg> (line group) and <l>. To record both physical and metrical lines, use <lb/> within <l>. See below, section Verse.

Columns

Column breaks can be marked using the <cb/> element; the n attribute should be used to indicate the column number and the number of columns:

<pb n="185r"/>
<fw type="runhead">L<expan>IBRO</expan> DEL RELOGIO DEL ARGENT VIVO</fw>
<cb n="1/2"/>
<div1 type="section">
<head>Aqui se compieça el prologo del libro
<lb/>del relogio dell argent uiuo.</head>

<lb/><hi rend="init"> D</hi>El relogio dell agua ...

<cb n="2/2"/>
 

If the number of columns changes in the middle of a page, a column break should be noted at the point of change; the n value will enable the reader to see what is happening.

Catchwords and Gatherings

Catchwords may be (should be, in all but purely literary transcription) transcribed using the <fw> element, with suggested type="cw":

<pb n="xxix"/>
 ... uerga el toro enla natura dela uacca. Et priega las manos del toro enlas espaldas dela ua <expan resp="ed">ca</expan>
<fw type="cw">ca. & priega</fw>
<pb n="xxx"/>

<fw> can also be used to record signatures (suggested <type="sig"> and, if historical information bears interest, foliation (<type="fol">).

The Written Text

Rubrication

When rubrics mark the formal divisions of the work, transcribe them as <head> elements. If they have some other function (e.g. something like the marginalia in The Rime of the Ancient Mariner), mark them up appropriately (e.g. as <note> elements).

Scribal Alterations

Use <add>, <del>. Editorial modifications should use <sic>, <corr>.

Word Division

The <w> element is used to mark a unit which is to be treated as a word but which might not be recognized as such by software relying solely on white space. Because of restrictions on which elements it can contain, <w> can be difficult to use consistently. For example, it cannot enclose <expan>, which means that it cannot be used to tag abbreviated words that feature unusual spacing.

Scribal Hands and Hand Shifts

Use <handShift/> if hand shifts are of interest.

Use the hand attribute on additions and deletions to note whether the change was performed by the original scribe or a later one. It is recommended either that all hands other than that of the main scribe of the page be grouped together under the code ma (manu altera) or else that all hands be distinguished and given keys like m2, m3, etc. In either case, all the hands should be declared in the TEI header.

Graphic Elements

The <figure> element is used for all graphic elements: historiated initials, illuminations, and diagrams.

Examples

Excerpts below illustrate how one might tag prose and verse.

Prose

<pb n="4v"/>
<div1 type="section">

<lb/><hi rend="marg">cxxxiij.</hi>
<hi rend="3init">P</hi>rimerament dela batalla de Etio patri
<lb/>cio contra Atilla et blenda Reyes delos
<lb/>hucnos.

<lb/><hi rend="marg">cxxxv.</hi>
<hi rend="1init">C</hi>omo atilla apres que fue vencido passo en Tu
<lb/>rugia que agora es dicha liege et delas cosas que
<lb/>apres se siguieron....
</div1>

Verse

<pb n="68v"/>
<cb n="1/2"/>
<div1 type="section">
<lg type="verse">
<l>...</l>
<l>bueno a<sic corr=" "></sic>dios /. & bueno al mu<expan>n</expan>do</l>
<l>esto yo /. Lo Jurare/.</l>
</lg></div1>
<div1 type="section">
<head>
<lb/>Este dezir fizo & ordeno mjçer
<lb/>fra<expan>n</expan>çisco ynperial natural de jeno
<lb/>ua estante & morador q<expan>ue</expan> fue enla
<lb/>...
<lb/>& sotil Jnvençion E de limadas
<lb/>Diçiones
</head>
<cb n="2/2"/>
<lg type="verse">
<l>En dos seteçie<expan>n</expan>tos /. & mas doss & tres</l>
<l>...</l>
<l>valed me señora /. espera<corr><expan>n</expan></corr> ça mja</l>
</lg>
<lg type="verse">
<l>En<sic corr=" "></sic>bozes mas baxas /. le oy dezjr ...</l>
<l><foreign lang="eng">modhed god hep</foreign> /. alu<expan>n</expan>brad m <sic corr=""></sic>agor<hi rend="superscript">a</hi></l>
<l>& a guissa de dueña /. q<expan>ue</expan> deuota ora</l>
<l><foreign lang="lat">quam bonus deus</foreign> /. le oy Rezar</l>
<l>& oyle a<sic corr=" "></sic>manera /. De apiaDar</l>
<l><foreign lang="arb">çayha bical habin /. al cabila mora</foreign></l>
</lg>
</div1>

A Note on Attributes

An XML tag consists minimally of a tag or element name enclosed in angle brackets, such as . The DTD's definition of that tag determines whether it contains other material (<l>Ki vult oir e vult saveir</l>, e.g.) or is empty (<pb/>). The DTD can also define one or more valid attributes for a particular element. An attribute's value may be constrained by a short list defined in the DTD, or the DTD may allow one to enter nearly any value one deems appropriate.

In addition to element-specific attributes, the TEI Guidelines define four global attributes that can occur in any TEI element: id (unique identifier), n (non-unique number or other label), lang (language), and rend (rendering). One can specify, in other words, that the line quoted above is in Old French (<l lang="fro">), that it occupies the first line of its text (<l n="1">), or both (<l n="1" lang="fro">).

For more information, see TEI Guidelines 3.5 Global Attributes, 35 Elements, and 3.7 Element Classes.

Next: Alphabetical List of Element Types