4 Basic Text Encoding

A TEI-conformant electronic text consists of the text itself (transcribed from some source, or created in electronic form), preceded by a TEI header, which identifies the electronic text and can also document the encoding practices used in creating it. The entire thing is enclosed within a tei.2 element, and preceded by an SGML declaration identifying the document type to be used in validating the document.

The SGML declaration won't be described here. Further below, I'll discuss the TEI header, and the specialized tags for front matter and back matter of the main text. In work with electronic text, however, the vast majority of one's time is spent within the body of the text itself, and so I begin with a description of tags for basic text encoding: paragraphs and other paragraph-like things, character- or phrase-level elements which occur within paragraphs, and so on.

4.1 Paragraphs

Mark paragraphs with the tag p. Paragraphs do not nest, and neither may p elements. For example:

 
<p>I call specific attention to
the authority given by the 21st Amendment
to the Constitution to prohibit transportation
or importation of intoxicating liquors into
any State in violation of the laws of such
State.</p>
<p>I ask the wholehearted cooperation of all our
citizens to the end that this return of individual
freedom shall not be accompanied by the repugnant
conditions that obtained prior to the adoption of
the 18th Amendment and those that have existed
since its adoption.  Failure to do this honestly
and courageously will be a living reproach to us
all.</p>
<p>I ask especially that no State shall by law
or otherwise authorize the return of the saloon
either in its old form or in some modern guise.
</p>

[This example, like most of the others not otherwise identified, is from Franklin D. Roosevelt's proclamation upon the repeal of Prohibition, in The Public Papers and Addresses of Franklin D. Roosevelt, vol. II (New York: Random House, 1938), pp. 510-514.]

4.2 Highlighted Phrases

Phrases which are highlighted in the source (or should be highlighted in the output), whether by italics, boldface, small caps, or other special treatment, should be tagged with the hi element. The rend attribute may optionally say how the phrase was highlighted. In the example below, the word whereas and the phrase therefore, I, Franklin D. Roosevelt are printed in small caps in the source:

 
<p><hi rend='sc'>Whereas</hi> the
Congress of the United States ... </p>
<p><hi rend='sc'>Whereas</hi> Section 217(a) of
the Act of Congress entitled "An Act ..." ...</p>
<p><hi rend='sc'>Whereas</hi> it appears ...
</p>
<p>Now, <hi rend='sc'>therefore, I, Franklin
D. Roosevelt</hi>, President of the United
States of America ... do hereby proclaim that
the Eighteenth Amendment to the Constitution of
the United States was repealed on the fifth
day of December, 1933.</p>

The rend attribute may be omitted if the rendering is of no interest, or if all highlighted phrases are rendered the same way. Its values may be chosen arbitrarily by the encoder --- the values used may then be used in turn to direct processing software to display or process the element correctly.

[It is normally preferable to mark phrases with element types indicating why they are highlighted, rather than simply indicating that they are highlighted. The full TEI encoding scheme defines elements which allow typographic highlighting to be identified as marking linguistic emphasis (emph), words in foreign languages (foreign), words in non-standard or specialized languages (distinct), technical terms (term), glosses on terms (gloss), and words mentioned rather than used (mentioned). The generic hi element is normally used only when it is economically or intellectually infeasible to supply one of the more informative alternatives.]

4.3 Quotations

Mark quotations from other works, or dialog spoken by characters in a narrative, as q (quotation) elements:

 
<p><hi rend='sc'>Whereas</hi>
Section 217(a) of the Act of Congress
entitled "An Act ..." approved June 16,
1933, provides as follows:
<q>Section 217(a) The President shall
proclaim the ... </q></p>

Block quotations and inline quotations are distinguished only by the value of their rend attribute; for the former, use the value "block" or "display", for the latter, use "inline".

[The full TEI scheme also provides a quote element which is restricted to real quotations from external sources, and unlike q may not be used for direct discourse and fictive quotations. Also provided there but missing here are cit, for quotations with attached bibliographic references to their sources, and soCalled, for material printed with `scare quotes' to indicate that the author disclaims full responsibility for it.]

4.4 Cross References

References to other documents, or to other locations in the current document, should be tagged with the ref tag:

 
WHEREAS <ref>Section 217(a) of
the Act of Congress ... approved June
16, 1933</ref>, provides as follows: ...

[The full scheme defines an empty element called ptr for use when the actual phrase referring to the other document or section can be generated automatically by software, as is usually done in document production systems.]

For cross references within the same SGML document, the target attribute may be used to indicate which section is being referred to; its value is the id value assigned to some element in the document. For example, the following cross reference:

 
I there expressed the hope,
and asked for united cooperation, that
this return of individual freedom would
not be accompanied by anti-social
conditions, such as the saloon and the
other evils of the pre-prohibition era.
(See also <ref target='pc1993-10-11'>Press
Conference of October 11, 1933, Item 137,
this volume</ref>.)

assumes the existence of some element elsewhere in the volume with the identifier given:

 
<div id='pc1933-10-11'>
<head>Press Conference, 11 October 1933</head>
<!-- ... -->
</div>

[This example is from the note in the Public Papers which follows the proclamation of the repeal of Prohibition.]

The div and head used in the example just given elements are described below.

4.5 Page Breaks

If the page breaks of the source are of interest, as they generally are for material transcribed from existing printed editions, record them using the pb element. This element is empty: that is, it has neither content nor an end-tag. It does not mark a passage or portion of the text, just a location within the text. The attribute n, defined for all TEI elements, should be used to indicate the page number; if page numbers from more than one edition are transcribed, the attribute ed should be used to distinguish the two paginations:

 
<p>I ask the wholehearted cooperation
of all our citizens to the end that
this return of individual freedom shall
not be accompanied by the repugnant
conditions that obtained prior to the
<pb n='512' ed='1938'>
adoption of
the 18th Amendment and those that
have existed  since its adoption....</p>

[In addition to page breaks, column and line breaks may be of interest; the full TEI scheme defines cb and lb elements for these, as well as a generic milestone element for boundaries and breaks of unforeseen type. Specialized tags in the TEI header can describe how these milestone elements are used in standard reference schemes for the work.]

4.6 Verse

Individual verse lines should be tagged with l (that's an "L"), stanzas or other verse structures above the level of the line should be tagged lg ( line group); the latter's type attribute may optionally be used to identify the formal structure in question, for retrieval or other purposes:

 
<lg type='quatrain'>
<l>Awake! for Morning in the Bowl of Night</l>
<l>Has flung the Stone that puts the Stars to Flight:</l>
<l>And Lo! the Hunter of the East
has caught</l>
<l>The Sultan's Turret in a Noose of Light.</l>
</lg>

[Example is from Rubáiyát of Omar Khayyám, tr. Edward Fitzgerald (New York: Collier; London: Collier-Macmillan, 1962), first quatrain of the first edition.]

When the indentation of the lines is significant, it can be recorded using the global rend attribute, with some suitable value:

 
<l rend='indent'>And Lo! the Hunter
of the East has caught</l>
<l>The Sultan's Turret in a Noose of Light.</l>

Of course, if the verse is quoted from another text, the l elements should be enclosed in a q element.

4.7 Drama

Drama should be encoded with the elements sp ( speech) and stage ( stage direction). Stage directions can occur either within speeches or between them. As may be seen in the example below, the speaker may be indicated with the who attribute on the sp element:

 
<sp who='Casca'>
<l>Speak, hands, for me!</l></sp>
<stage>They stab Caesar.</stage>
<sp who='Julius Caesar'>
<l>Et tu, Brute? -- then fall, Caesar!</l>
<stage>Dies.</stage></sp>

[Example is from a modern student reprint of Julius Caesar, III.i: William Shakespeare, The Tragedy of Julius Caesar (New York: Airmont, 1965).]

When the precise form of the speaker atribution in the source is important, the speaker may be identified by a separate speaker element at the beginning of the sp element.

 
<sp><speaker>Cas.</speaker>
<l>Speak, hands, for me!</l></sp>
<stage>They stab Caesar.</stage>
<sp><speaker>Caes.</speaker>
<l>Et tu, Brute? -- then fall, Caesar!</l>
<stage>Dies.</stage></sp>

These tags may also be used for material not written as drama, but presented using dramatic conventions (e.g. transcriptions of speeches, or of press conferences):

 
The brave men living and dead
who struggled here have consecrated it
far above our power to add or detract.
<stage>[Applause.]</stage>
<!-- ... -->
and that Governments of the people,
by the people, and for the people,
shall not perish from the earth.
<stage>[Long-continued applause.]
</stage>

[Newspaper version of Abraham Lincoln, "Address Delivered at the Dedication of the Cemetery at Gettysburg," in The Collected Works of Abraham Lincoln, ed.Roy P. Basler, vol. VII (New Brunswick: Rutgers University Press, 1953), pp. 20-21. Since in this text such stage-directions are always printed in brackets, the encoder might choose to omit the square brackets from the transcription, noting in the header that stage elements are always bracketed.]

As with verse, if the drama is quoted from another text, it should be enclosed in a q element.

4.8 Bibliographic References

Bibliographic references should normally be enclosed in bibl elements; within such elements, or outside them, title may be used to mark titles of articles, books, journals, etc. Its level attribute takes the values A, M, J, S, or U to show whether the title is an analytic (article) title, a monogrphic (book) title, the title of a journal, that of a series, or that of unpublished material such as a thesis. For example a reference to: "Inaugural Address, March 4, 1933," in The Public Papers and Addresses of Franklin D. Roosevelt, vol. II (New York: Random House, 1938), pp. 510-514 would be encoded thus:

 
<bibl>
<title level='A'>Inaugural Address,
March 4, 1933</title>, in
<title level='M'>The Public Papers and
Addresses of Franklin D. Roosevelt
</title>, vol. II
(New York:  Random House, 1938), pp. 11-16.
</bibl>

[Omitted from this bare-bones tag set are tags for other bibliographic elements, such as author, editor, publisher, and so on. Also omitted are the elements biblStruct and biblFull, which require consistently structured bibliographic entries and are useful when all the items in a bibliography must be structured correctly (e.g., for machine processing).]

4.9 Omissions

If material has been omitted from an electronic text (e.g. because it is illegible or not of interest to the expected users, the omission should normally be indicated using a gap element at the point of omission. The attributes desc, reason, and extent may optionally be used to describe what was omitted, to explain why, and to give an approximate size for it. For example:

 
<p>
Suppose I see two individuals approaching
whose rank I wish to ascertain.  They are,
we will suppose, a Merchant and a Physician,
or in other words, an Equilateral Triangle
and a Pentagon:  how am I to distinguish
them?</p>
<p><gap desc='geometric figure'
reason='editorial policy'
extent='ca. 14 lines'></p>
<p>It will be obvious ... </p>

[Example is from Edwin A. Abbott, Flatland: A Romance of Many Dimensions (1884; rpt. New York: Dover, 1992), p. 19, extract from chapter 6, "Recognition by Sight."] [The bare-bones tag set omits the elements defined by the standard TEI tag set for marking other kinds of editorial interventions or authorial alterations to a text, such as cancellations, insertions, corrections or failure to correct errors, normalized spelling, illegible writing or inaudible speech, and the expansion of abbreviations. ]

4.10 Notes

Notes in the text, whether footnotes, endnotes, or inline block notes, should be tagged with the note element. The location may be given, if desired, in the place attribute. Authorial notes may be distinguished from editorial notes by means of the resp attribute, which indicates who is responsible for the note. For example:

 
<p>IN WITNESS WHEREOF,
I have hereunto set my hand and caused
the seal of the United States to be
affixed.</p>
<note resp='ed' place=inline><p>The 72d
Congress, which
convened following the 1932 election,
passed the Twenty-first Amendment to the
Constitution to repeal the Eighteenth
Amendment.</p>
<p> ... </p>
</note>

Footnotes and endnotes should normally be transcribed at their point of attachment. Their number may optionally be given in the n attribute:

 
... have consecrated it
far above our power<note place='foot' n=21>
Philadelphia <title>Inquirer</title> has
<q>our poor attempts</q> and Chicago
<title level='J'>Tribune</title> has
<q>our poor power.</q></note>
to add or detract.

4.11 Lists

Lists should be tagged using the list and item elements; a heading or title for the list should be tagged as a head. Lists may be distinguished as ordered (numbered), unordered (bulleted), etc., by means of the type attribute. For example:

 
The President shall proclaim
the date of
<list type=ordered>
<item n='(1)'>the close of the first fiscal
year ending June 30 of any year after the
year 1933, in which ..., or</item>
<item n='(2)'>the repeal of the
eighteenth amendment to the Constitution,
</item>
</list>
whichever is the earlier.

The full TEI scheme also defines a label element for use as an alternative to using the n attribute to give item numbers or labels.

4.12 What Is Missing?

Notes in the preceding sections have mentioned some of the elements defined in the full TEI scheme's core tag set but omitted from this bare-bones version. In addition to those already mentioned, tags omitted here include those for proper nouns and other references to people and places, addresses, numbers, units of measure and measured quantities, dates, and times of day.

The full scheme also defines optional tag sets for hypertext linking, analysis or interpretation (including both literary and linguistic analysis) of the text, manuscript transcription, text-critical apparatus, tables, figures, and other specialized interests.