[Current (2006) revision of this document]
Prefatory note
TEI Lite was the name adopted for what the TEI editors originally conceived of as a simple demonstration of how the TEI encoding scheme might be adopted to meet 90% of the needs of 90% of the TEI user community. In retrospect, it was predictable that many people should imagine TEI Lite to be all there is to TEI, or find TEI Lite to be far too heavy for their needs (to meet the latter criticism, Michael also prepared a special barebones version of TEI Lite).
TEI Lite was based largely on our observations of existing and previous practice in the encoding of texts, particularly as manifest in the collections of the Oxford Text Archive and in our own experience. It is therefore unsurprising that it seems to have become, if not a de facto standard, at least a common point of departure for electronic text centres and encoding projects world wide. Maybe the fact that we actually produced this shortish, readable, manual for it also helped.
That manual was, of course, authored and is maintained in the DTD it describes, originally as an XML document. This makes it easy to produce a number of differently formatted versions in HTML, PDF, etc., some of which can be found in The TEI Vault.
Early adopters of TEI Lite included a number of ‘Electronic Text Centers’, many of whom produced their own documentation and tutorial materials (some examples are listed in the TEI Tutorials pages).
With the publication of TEI P4, the XML version of the TEI Guidelines, which uses the generation of TEI Lite as an example of the Modification mechanism built into the TEI Guidelines, the opportunity has been taken to produce a lightly revised version of the present document. This revision documents the XML version of the TEI Lite DTD.
Lou Burnard, May 2002Contents
- Introduction
- A Short Example
- The Structure of a TEI Text
- Encoding the Body
- Page and Line Numbers
- Marking Highlighted Phrases
- Notes
- Cross References and Links
- Editorial Interventions
- Omissions, Deletions, and Additions
- Names, Dates, Numbers and Abbreviations
- Lists
- Bibliographic Citations
- Tables
- Figures and Graphics
- Interpretation and Analysis
- Technical Documentation
- Character Sets, Diacritics, etc.
- Front and Back Matter
- The Electronic Title Page
-
Appendix A: List of Elements Described
- Appendix A.1: Global Attributes
- Appendix A.2: Elements in TEI Lite
This document provides an introduction to the recommendations of the Text Encoding Initiative (TEI), by describing a manageable subset of the full TEI encoding scheme. The scheme documented here can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different computer systems. It is also fully compatible with the full TEI scheme, as defined by TEI document P4, Guidelines for Electronic Text Encoding and Interchange, published in May 2002, and available from the TEI Consortium website at http://www.tei-c.org/Guidelines/P4/html/index.html.
Introduction
The Text Encoding Initiative (TEI) Guidelines are addressed to anyone who wants to interchange information stored in an electronic form. They emphasize the interchange of textual information, but other forms of information such as images and sound are also addressed. The Guidelines are equally applicable in the creation of new resources and in the interchange of existing ones.
The Guidelines provide a means of making explicit certain features of a text in such a way as to aid the processing of that text by computer programs running on different machines. This process of making explicit we call markup or encoding. Any textual representation on a computer uses some form of markup; the TEI came into being partly because of the enormous variety of mutually incomprehensible encoding schemes currently besetting scholarship, and partly because of the expanding range of scholarly uses now being identified for texts in electronic form.
The TEI Guidelines describe an encoding scheme which can be expressed using a number of different formal languages. The first editions of the Guidelines used the Standard Generalized Markup Language (SGML); the most recent edition (TEI P4, 2002) can also be expressed in the Extensible Markup Language (XML); future versions may also be expressible in other schema languages. Such languages have in common the definition of text in terms of elements and attributes, and rules governing their appearance within a text. The TEI's use of XML is ambitious in its complexity and generality, but it is fundamentally no different from that of any other XML markup scheme, and so any general-purpose XML-aware software is able to process TEI-conformant texts.
The TEI was sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing, and is now maintained and developed by an independent membership consortium, hosted by four major Universities. Funding has been provided in part from the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. The Guidelines were first published in May 1994, after six years of development involving many hundreds of scholars from different academic disciplines worldwide. During the years that followed, the Guidelines were increasingly influential in the development of the digital library, in the language industries, and even in the development of the World Wide Web itself. The TEI consortium was set up in January 2001, and a year later produced the current fully revised edition of the Guidelines, which has been entirely revised for XML compatibility.
- suffice to represent the textual features needed for research;
- be simple, clear, and concrete;
- be easy for researchers to use without special-purpose software;
- allow the rigorous definition and efficient processing of texts;
- provide for user-defined extensions;
- conform to existing and emergent standards.
- the common core of textual features be easily shared;
- additional specialist features be easy to add to (or remove from) a text;
- multiple parallel encodings of the same feature should be possible;
- the richness of markup should be user-defined, with a very small minimal requirement;
- adequate documentation of the text and its encoding should be provided.
The present document describes a manageable selection from the extensive set of elements and recommendations resulting from those design goals, which is called TEI Lite.
In selecting from the several hundred elements defined by the full TEI scheme, we have tried to identify a useful ‘starter set’, comprising the elements which almost every user should know about. Experience working with TEI Lite will be invaluable in understanding the full TEI DTD and in knowing which optional parts of the full DTD are necessary for work with particular types of text.
- it should include most of the TEI ‘core’ tag set, since this contains elements relevant to virtually all text types and all kinds of text-processing work;
- it should be able to handle adequately a reasonably wide variety of texts, at the level of detail found in existing practice (as demonstrated in, for example, the holdings of the Oxford Text Archive);
- it should be useful for the production of new documents as well as encoding of existing ones;
- it should be usable with a wide range of existing XML software;
- it should be derivable from the full TEI DTD using the extension mechanisms described in the TEI Guidelines;
- it should be as small and simple as is consistent with the other goals.
The reader may judge our success in meeting these goals for him or herself. At the time of writing (1995), our confidence that we have at least partially done so is borne out by its use in practice for the encoding of real texts. The Oxford Text Archive uses TEI Lite when it translates texts from its holdings from their original markup schemes into SGML; the Electronic Text Centers at the University of Virginia and the University of Michigan have used TEI Lite to encode their holdings. And the Text Encoding Initiative itself uses TEI Lite, in its current technical documentation — including this document.
Although we have tried to make this document self-contained, as suits a tutorial text, the reader should be aware that it does not cover every detail of the TEI encoding scheme. All of the elements described here are fully documented in the TEI Guidelines themselves, which should be consulted for authoritative reference information on these, and on the many others which are not described here. Some basic knowledge of XML is assumed.
A Short Example
We begin with a short example, intended to show what happens when a passage of prose is typed into a computer by someone with little sense of the purpose of mark-up, or the potential of electronic texts. In an ideal world, such output might be generated by a very accurate optical scanner. It attempts to be faithful to the appearance of the printed text, by retaining the original line breaks, by introducing blanks to represent the layout of the original headings and page breaks, and so forth. Where characters not available on the keyboard are needed (such as the accented letter a in faàl or the long dash), it attempts to mimic their appearance.
- the page numbers and running titles are intermingled with the text in a way which makes it difficult for software to disentangle them;
- no distinction is made between single quotation marks and apostrophe, so it is difficult to know exactly which passages are in direct speech;
- the preservation of the copy text's hyphenation means that simple-minded search programs will not find the broken words;
- the accented letter in faàl and the long dash have been rendered by ad hoc keying conventions which follow no standard pattern and will be processed correctly only if the transcriber remembers to mention them in the documentation;
- paragraph divisions are marked only by the use of white space, and hard carriage returns have been introduced at the end of each line. Consequently, if the size of type used to print the text changes, reformatting will be problematic.
- Paragraph divisions are now marked explicitly.
- Apostrophes are distinguished from quotation marks.
- Entity references are used for the accented letter and the long dash.
- Page divisions have been marked with an empty pb element alone.
- To simplify searching and processing, the lineation of the original has not been retained and words broken by typographic accident at the end of a line have been re-assembled without comment. If the original lineation were of interest, as it might be for an important printing, it could easily be recorded, though it has not been here.
- For convenience of proof reading, a new line has been introduced at the start of each paragraph, but the indentation is removed.
- a regularized form of the passages in dialect could be provided;
- footnotes glossing or commenting on any passage could be added;
- pointers linking parts of this text to others could be added;
- proper names of various kinds could be distinguished from the surrounding text;
- detailed bibliographic information about the text's provenance and context could be prefixed to it;
- a linguistic analysis of the passage into sentences, clauses, words, etc., could be provided, each unit being associated with appropriate category codes;
- the text could be segmented into narrative or discourse units;
- systematic analysis or interpretation of the text could be included in the encoding, with potentially complex alignment or linkage between the text and the analysis, or between the text and one or more translations of it;
- passages in the text could be linked to images or sound held on other media.
The Structure of a TEI Text
All TEI-conformant texts contain (a) a TEI header (marked up as a teiHeader element) and (b) the transcription of the text proper (marked up as a text element).
The TEI header provides information analogous to that provided by the title page of a printed text. It has up to four parts: a bibliographic description of the machine-readable text, a description of the way it has been encoded, a non-bibliographic description of the text (a text profile), and a revision history. The header is described in more detail in section The Electronic Title Page.
A TEI text may be unitary (a single work) or composite (a collection of single works, such as an anthology). In either case, the text may have an optional front or back. In between is the body of the text, which, in the case of a composite text, may consist of groups, each containing more groups or texts.
In the remainder of this document, we discuss chiefly simple text structures. The discussion in each case consists of a short list of relevant TEI elements with a brief definition of each, followed by definitions for any attributes specific to that element. In most cases, short examples are also given.
Encoding the Body
- front
- contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper.
- group
- contains a number of unitary texts or groups of texts.
- body
- contains the whole body of a single unitary text, excluding any front or back matter.
- back
- contains any appendixes, etc., following the main part of a text.
Text Division Elements
- p
- marks paragraphs in prose.
- div
- contains a subdivision of the front, body, or back of a text.
- div
- contains a first-level subdivision of the front, body, or back of a text (the largest, if div0 is not used, the second largest if it is).
When structural subdivisions smaller than a div are necessary, a div may be divided into div2 elements, a div2 into smaller div3 elements, etc., down to the level of div7. If more than seven levels of structural division are present, one must either modify the TEI tag set to accept div8, etc., or else use the unnumbered div element: a div may be subdivided by smaller div elements, without limit to the depth of nesting.
- type
- This indicates the conventional name for this category of text division. Its value will typically be ‘Book’, ‘Chapter’, ‘Poem’, etc. Other possible values include ‘Group’ for groups of poems, etc., treated as a single unit, ‘Sonnet’, ‘Speech’, and ‘Song’. Note that whatever value is supplied for the type attribute of the first div, div, div2, etc., in a text is assumed to apply for all subsequent div, divs (etc.) within the same body. This implies that a value must be given for the first division element of each type, or whenever the value changes.
- id
- This specifies a unique identifier for the division, which may be used for cross references or other links to it, such as a commentary, as further discussed in section Cross References and Links. It is often useful to provide an id attribute for every major structural unit in a text, and to derive the ID values in some systematic way, for example by appending a section number to a short code for the title of the work in question, as in the examples below.
- n
- The n attribute specifies a mnemonic short name or number for the division, which can be used to identify it in preference to the value given for the id attribute. If a conventional form of reference or abbreviation for the parts of a work already exists (such as the book/chapter/verse pattern of Biblical citations), the n attribute is the place to record it.
Headings and Closings
- head
- contains any heading, for example, the title of a section, or the heading of a list or glossary.
- trailer
- contains a closing title or footer appearing at the end of a division of a text.
Prose, Verse and Drama
- l
- contains a single, possibly incomplete, line of verse.
Attributes include:
- part
- specifies whether or not the line is metrically complete. Legal values are: F for the final part of an incomplete line, Y if the line is metrically incomplete, N if the line is complete, or if no claim is made as to its completeness, I for the initial part of an incomplete line, M for a medial part of an incomplete line.
- lg
- contains a group of verse lines functioning as a formal unit e.g. a stanza, refrain, verse paragraph, etc.
- sp
- contains an individual speech in a performance text, or a
passage presented as such in a prose or verse text. Attributes
include:
- who
- identifies the speaker of the part by supplying an ID.
- speaker
- contains a special form of heading or label, giving the name of one or more speakers in a performance text or fragment.
- stage
- contains any kind of stage direction within a performance text
or fragment. Attributes include:
- type
- indicates the kind of stage direction. Suggested values include entrance, exit, setting, delivery, etc.
Note that the l element marks verse lines, not typographic lines: the original lineation of the first few lines above has not therefore been made explicit by this encoding, and may be lost. The lb element described in section Page and Line Numbers may be used to mark typographic lines if so desired.
Page and Line Numbers
- pb
- marks the boundary between one page of a text and the next in a standard reference system.
- lb
- marks the start of a new (typographic) line in some edition or version of a text.
- ed
- indicates the edition or version in which the page break is located at this point.
When working from a paginated original, it is often useful to record its pagination, if only to simplify later proof-reading. Recording the line breaks may be useful for the same reason; treatment of end-of-line hyphenation in printed source texts will require some consideration.
- milestone
- marks the boundary between sections of a text, as indicated by
changes in a standard reference system. Attributes include:
- ed
- indicates the edition or version to which the milestone applies.
- unit
- indicates what kind of section is changing at this milestone.
The names used for types of unit and for editions referred to by the ed and unit attributes may be chosen freely, but should be documented in the header.
The milestone element may be used to replace the others, or the others may be used as a set; they should not be mixed arbitrarily.
Marking Highlighted Phrases
Changes of Typeface, etc.
Highlighted words or phrases are those made visibly different from the rest of the text, typically by a change of type font, handwriting style, or ink color, intended to draw the reader's attention to them.
The global rend attribute can be attached to any element, and used wherever necessary to specify details of the highlighting used for it. For example, a heading rendered in bold might be tagged head rend="bold", and one in italic head rend="italic".
- hi
- marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.
- emph
- marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect.
- foreign
- identifies a word or phrase as belonging to some language other than that of the surrounding text.
- mentioned
- marks words or phrases mentioned, not used.
- term
- contains a single-word, multi-word or symbolic designation which is regarded as a technical term.
- title
- contains the title of a work, whether article, book, journal,
or series, including any alternative titles or subtitles. Attributes
include:
- level
- indicates whether this is the title of an article, book, journal, series, or unpublished material. Legal values are: m for monographic title (book, collection, or other item published as a distinct item, including single volumes of multi-volume works); s (series title); j (journal title); u for title of unpublished material (including theses and dissertations unless published by a commercial press); a for analytic title (article, poem, or other item published as part of a larger item).
- type
- classifies the title according to some convenient typology. Sample values include: abbreviated, main, subordinate (for subtitles and titles of parts), and parallel (for alternate titles, often in another language, by which the work is also known).
Some features (notably quotations and glosses) may be found in a text either marked by highlighting, or with quotation marks. In either case, the elements q and gloss (as discussed in the following section) should be used. If the rendition is to be recorded, use the global rend attribute.
Interpreting the role of the highlighting, the sentence might look like this:On the one hand the Nibelungenlied is associated with the new rise of romance of twelfth-century France, the romans d'antiquité;, the romances of Chrétien de Troyes, and the German adaptations of these works by Heinrich van Veldeke, Hartmann von Aue, and Wolfram von Eschenbach.
Quotations and Related Features
- q
- contains a quotation or apparent quotation --- a representation
of speech or thought marked as being quoted from someone else (whether
in fact quoted or not); in narrative, the words are usually those of
a character or speaker; in dictionaries, q may be used to
mark real or contrived examples of usage. Attributes include:
- type
- may be used to indicate whether the quoted matter is spoken or thought, or to characterize it more finely. Sample values include: spoken (for representation of direct speech, usually marked by quotation marks) and thought (for representation of thought, e.g. internal monologue).
- who
- identifies the speaker of a piece of direct speech.
- mentioned
- marks words or phrases mentioned, not used.
- soCalled
- contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics.
- gloss
- marks a word or phrase which provides a gloss or definition for
some other word or phrase. Attributes include:
- target
- identifies the associated word or phrase.
To record how a quotation was printed (for example, in-line or set off as a display or block quotation), the rend attribute should be used. This may also be used to indicate the kind of quotation marks used.
The creator of the electronic text must decide whether quotation marks are replaced by the tags or whether the tags are added and the quotation marks kept. If the quotation marks are removed from the text, the rend attribute may be used to record the way in which they were rendered in the copy text.
As with highlighting, it is not always possible and may not be considered desirable to interpret the function of quotation marks in a text in this way. In such cases, the tag hi rend="quoted" might be used to mark quoted text without making any claim as to its status.
Foreign Words or Expressions
As these examples show, the foreign element should not be used to tag foreign words if some other more specific element such as title, mentioned, or term applies. The global lang attribute may be attached to any element to show that it uses some other language than that of the surrounding text.
Notes
- note
- contains a note or annotation. Attributes include:
- type
- describes the type of note.
- resp
- indicates who is responsible for the annotation: author, editor, translator, etc. The value might be author, editor, etc., or the initials of the individual who added the annotation.
- place
- indicates where the note appears in the source text. Sample values include inline, interlinear, left, right, foot, and end, for notes which appear as marked paragraphs in the body of the text, between the lines, in the left or right margin, at the foot of the page, or at the end of the chapter or volume, respectively.
- target
- indicates the point of attachment of a note, or the beginning of the span to which the note is attached.
- targetEnd
- points to the end of the span to which the note is attached, if the note is not embedded in the text at that point.
- anchored
- indicates whether the copy text shows the exact place of reference for the note.
The n attribute may be used to supply the number or identifier of a note if this is required. The resp attribute should be used consistently to distinguish between authorial and editorial notes, if the work has both kinds; otherwise, the TEI header should state which kind they are.
Cross References and Links
Explicit cross references or links from one point in a text to another in the same SGML document may be encoded using the elements described in section Simple Cross References. References or links to elements of some other SGML document, or to parts of non-SGML documents, may be encoded using the TEI extended pointers described in section Extended Pointers. Implicit links (such as the association between two parallel texts, or that between a text and its interpretation) may be encoded using the linking attributes discussed in section Linking Attributes.
Simple Cross References
- ref
- a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment.
- ptr
- a pointer to another location in the current document in terms of one or more identifiable elements.
- target
- specifies the destination of the pointer as one or more SGML identifiers
- type
- categorizes the pointer in some respect, using any convenient set of categories.
- targType
- specifies the type (or types) of element to which this pointer may point.
- crDate
- specifies when this pointer was made.
- resp
- specifies the creator of the pointer.
The difference between these two elements is that ptr is an empty element, simply marking a point from which a link is to be made, whereas ref may contain some text as well — typically the text of the cross-reference itself. The ptr element would be used for a cross reference which is to be indicated by some non-verbal means such as a symbol or icon, or in an electronic text by a button. It is also useful in document production systems, where the formatter can generate the correct verbal form of the cross reference.
This reference should fail if the element with identifier dspec is neither a div nor a div2. Note however that this additional check cannot be carried out by an SGML or XML parser alone, since such parsers can only check that some element dspec exists.
- anchor
- specifies a location or point within a document so that it may be pointed to.
- seg
- identifies a span or segment of text within a document so that
it may be pointed to. Attributes include
- type
- categorizes the segment
The type attribute should be used (as above) to distinguish amongst different purposes for which these general purpose elements might be used in a text. Some other uses are discussed in section Linking Attributes below.
Extended Pointers
- xptr
- defines a pointer to another location in the current document or an external document.
- xref
- defines a pointer to another location in the current document or an external document, possibly modified by additional text or comment.
- doc
- specifies the document within which the required location is to be found, by default the current document.
- from
- specifies the start of the destination of the pointer as an expression in the TEI extended pointer syntax, by default the whole of the document indicated by the doc attribute.
- to
- specifies the endpoint of the destination of the pointer as an expression in the TEI extended pointer syntax; may only be specified if the from attribute has been.
A full specification of the language used to express the target of TEI extended pointers is beyond the scope of this document; here we list here only a few of its more generally useful features. The full Guidelines should be consulted for more detail.
This example assumes that some system or public entity with the name P3 has been declared. This declaration has to be included within the DTD in force when the document is parsed; the manner of doing so is specific to the authoring software in use (as further discussed in section Figures and Graphics).
The from attribute is used to specify some location within whatever document is specified by the doc attribute. The specification uses a special language, called the TEI extended pointer syntax; only some details of which are given here. In this language, locations are defined as a series of steps, each one identifying some part of the document, often in terms of the locations identified by the previous step. For example, you would point to the third sentence of the second paragraph of chapter two by selecting chapter two in the first step, the second paragraph in the second step, and the third sentence in the last step. A step can be defined in terms of the document tree itself, using such concepts as parent, descendent, preceding, etc. or, more loosely, in terms of text patterns, word or character positions. You can also use a foreign (non-SGML) notation, or specify a location within a graphic in terms of its co-ordinate system.
The from and to attributes use the same notation. Each points to some portion of the target document; the extended pointer as a whole points to the section beginning at the start of the from and running to the end of the to.
- child
- elements contained by this one.
- ancestor
- elements which contains this one, directly or indirectly.
- previous
- elements with the same parent as this one but preceding it in the document.
- next
- elements with the same parent as this one and following it in the document.
- preceding
- elements in the document which start before this one does, irrespective of their parents.
- following
- elements in the document which start after this one does, irrespective of their parents.
- a positive or negative number, indicating which of the possibly many elements found is intended (+1 indicating the first element encountered, starting from the current location, and -1 indicating the last), or the keyword all, indicating that all the elements in the set are to be pointed at;
- a generic identifier, indicating the type of element required, or a star indicating that any element type will do;
- a set of attribute names and values, indicating that the element selected should have attributes with the names and values specified, if any.
The TEI Extended Pointer Syntax was defined before the more recent XLink specifications, which are however to some extent derived from them. Work is currently going on to harmonize the two specification languages.
Linking Attributes
- ana
- links an element with its interpretation.
- corresp
- links an element with one or more other corresponding elements.
- next
- links an element to the next element in an aggregate.
- prev
- links an element to the previous element in an aggregate.
Editorial Interventions
The process of encoding an electronic text has much in common with the process of editing a manuscript or other text for printed publication. In both cases a conscientious editor may wish to record both the original state of the source and any editorial correction or other change made in it. The elements discussed in this and the next section provide some facilities for meeting these needs.
- corr
- contains the correct form of a passage apparently erroneous in
the copy text. Attributes include:
- sic
- gives the original form of the apparent error in the copy text.
- resp
- signifies the editor or transcriber responsible for suggesting the correction held as the content of the corr element.
- cert
- signifies the degree of certainty ascribed to the correction held as the content of the corr element.
- sic
- contains text reproduced although apparently incorrect or
inaccurate. Attributes include:
- corr
- gives a correction for the apparent error in the copy text.
- resp
- signifies the editor or transcriber responsible for suggesting the correction.
- cert
- signifies the degree of certainty ascribed to the correction.
- orig
- contains the original form of a reading, for which a
regularized form is given in an attribute value. Attributes include:
- reg
- gives a regularized (normalized) form of the text.
- resp
- identifies the individual responsible for the regularization of the word or phrase.
- reg
- contains a reading which has been regularized or normalized in
some sense. Attributes include:
- orig
- gives the unregularized form of the text as found in the source copy.
- resp
- identifies the individual responsible for the regularization of the word or phrase.
Omissions, Deletions, and Additions
- add
- contains letters, words, or phrases inserted in the text by an
author, scribe, annotator, or corrector. Attributes include:
- place
- if the addition is written into the copy text, indicates where the additional text is written. Sample values include inline, supralinear, infralinear, left (in left margin), right (in right margin), top, bottom, etc.
- gap
- indicates a point where material has been omitted in a
transcription, whether for editorial reasons described in the TEI
header, as part of sampling practice, or because the material is
illegible or inaudible. Attributes include:
- desc
- gives a description of the omitted text.
- resp
- indicates the editor, transcriber or encoder responsible for the decision not to provide any transcription of the text and hence the application of the gap tag.
- del
- contains a letter, word or passage deleted, marked as deleted,
or otherwise indicated as superfluous or spurious in the copy text by
an author, scribe, annotator or corrector. Attributes include:
- type
- classifies the type of deletion using any convenient typology.
- status
- may be used to indicate faulty deletions, e.g. strikeouts which include too much or too little text.
- hand
- signifies the hand of the agent which made the deletion.
- unclear
- contains a word, phrase, or passage which cannot be transcribed
with certainty because it is illegible or inaudible in the source.
Attributes include:
- reason
- indicates why the material is hard to transcribe.
- resp
- indicates the individual responsible for the transcription of the letter, word or passage contained with the unclear element.
Names, Dates, Numbers and Abbreviations
The TEI scheme defines elements for a large number of ‘data-like’ features which may appear almost anywhere within almost any kind of text. These features may be of particular interest in a range of disciplines; they all relate to objects external to the text itself, such as the names of persons and places, numbers and dates. They also pose particular problems for many natural language processing (NLP) applications because of the variety of ways in which they may be presented within a text. The elements described here, by making such features explicit, reduce the complexity of processing texts containing them.
Names and Referring Strings
- rs
- contains a general purpose name or referring string. Attributes
include:
- type
- indicates more specifically the object referred to by the referencing string. Values might include person, place, ship, element, etc.
- name
- contains a proper noun or noun phrase. Attributes include:
- type
- indicates the type of the object which is being named by the phrase.
The name element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the rs element, or nested within it if a referring string contains a mixture of common and proper nouns.
Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as van or de la, may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.
- key
- provides an alternative identifier for the object being named, such as a database record key.
- reg
- gives a normalized or regularized form of the name used.
More detailed tagging of the components of proper names is also possible, using the additional tag set for names and dates.
Dates and Times
- date
- contains a date in any format. Attributes include:
- calendar
- indicates the system or calendar to which the date belongs.
- value
- gives the value of the date in some standard form, usually yyyy-mm-dd.
- time
- contains a phrase defining a time of day in any format.
Attributes include:
- value
- gives the value of the time in a standard form.
The value attribute specifies a normalized form for the date or time, using a recognized format such as ISO 8601. Partial dates or times (e.g. ‘1990’, ‘September 1990’, ‘twelvish’) can usually be expressed by simply omitting a part of the value supplied; alternatively imprecise dates or times (for example ‘early August’, ‘some time after ten and before twelve’) may be expressed as date or time ranges. If either end of the date or time range is known to be accurate, (for example, ‘at some time before 1230’, ‘a few days after Hallowe'en’) the exact attribute may be used to specify this.
Numbers
- num
- contains a number, written in any form. Attributes include:
- type
- indicates the type of numeric value. Suggested values include: fraction, ordinal (for ordinal numbers, e.g. ‘21st’), percentage, and cardinal (an absolute number, e.g. ‘21’, ‘21.5’, etc.)
- value
- supplies the value of the number in an application-dependent standard form.
Abbreviations and their Expansion
- abbr
- contains an abbreviation of any sort. Attributes include:
- expan
- gives an expansion of the abbreviation.
- type
- allows the encoder to classify the abbreviation according to some convenient typology. Sample values include contraction, suspension, brevigraph, superscription, or acronym. The type attribute may also be given values like title (for titles of address), geographic, organization, etc., describing the nature of the object referred to.
This element is also particularly useful where manuscript materials in which abbreviation is very frequent are being transcribed.
Addresses
- address
- contains a postal or other address, for example of a publisher, an organization, or an individual.
- addrLine
- contains one line of a postal or other address.
Lists
- list
- contains any sequence of items organized as a list. Attributes
include:
- type
- describes the form of the list. Suggested values include: ordered, bulleted (for lists with numbered or lettered items, and lists with bullet-marked items, respectively), gloss (for lists consisting of a set of technical terms, each marked with a label element and accompanied by a gloss or definition marked as an item), and simple (for lists with items not marked with number or bullets.
- item
- contains one component of a list.
- label
- contains the label associated with an item in a list; in glossaries, marks the term being defined.
Where the internal structure of a list item is more complex, it may be preferable to regard the list as a table, for which special-purpose tagging is defined below (Tables).
Lists of bibliographic items should be tagged using the listBibl element, described in the next section.
Bibliographic Citations
- bibl
- contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
- author
- in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item.
- biblScope
- defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.
- date
- contains a date in any format.
- editor
- secondary statement of responsibility for a
bibliographic item, for example the name of an individual, institution
or organization, (or of several such) acting as editor, compiler,
translator, etc. Attributes include:
- role
- specifies the nature of the intellectual responsibility. Sample values include translator, compiler, illustrator, etc.; the default value is editor.
- imprint
- groups information relating to the publication or distribution of a bibliographic item.
- publisher
- provides the name of the organization responsible for the publication or distribution of a bibliographic item.
- pubPlace
- contains the name of the place where a bibliographic item was published.
- series
- contains information about the series in which a book or other bibliographic item has appeared.
- title
- contains the title of a work, whether article, book, journal,
or series, including any alternative titles or subtitles. Attributes
include
- type
- categorizes the title in some way, for example as a main, subordinate, etc.
- level
- indicates the bibliographic level or class of title. Legal values are described in section Changes of Typeface, etc.
He was a member of Parliament for Warwickshire in 1445, and died March 14, 1470 (according to Kittredge, Harvard Studies 5. 88ff).
For lists of bibliographic citations, the listBibl element should be used; it may contain a series of bibl elements.
Tables
- table
- contains text displayed in tabular form, in rows and columns.
Attributes include:
- rows
- indicates the number of rows in the table.
- cols
- indicates the number of columns in each row of the table.
- row
- contains one row of a table. Attributes include:
- role
- indicates the kind of information held in the cells of this row. Suggested values include label for labels or descriptive information, and data for actual data values.
- cell
- contains one cell of a table. Attributes include:
- role
- indicates the kind of information held in the cell. Suggested values include label for labels or descriptive information, and data for actual data values.
- cols
- indicates the number of columns occupied by this cell.
- rows
- indicates the number of rows occupied by this cell.
Figures and Graphics
Not all the components of a document are necessarily textual. The most straightforward text will often contain diagrams or illustrations, to say nothing of documents in which image and text are inextricably intertwined, or electronic resources in which the two are complementary.
- figure
- marks the spot at which a graphic is to be inserted in a
document. Attributes include:
- entity
- the name of a pre-defined system entity containing a digitized version of the graphic to be inserted.
- figDesc
- contains a textual description of the appearance or content of a graphic, for use when documenting an image without displaying it.
Any textual information accompanying the graphic, such as a heading and/or caption, may be included within the figure element itself, in a head and one or more p elements, as may also any text appearing within the graphic itself. It is strongly recommended that a prose description of the image be supplied, as the content of a figDesc element, for the use of applications which are not able to render the graphic, and to render the document accessible to vision-impaired readers. (Such text is not normally considered part of the document proper.)
When a digitized version of the graphic concerned is available, it is clearly preferable to embed it at the appropriate point within the document. Graphic elements such as pictures are typically stored in separate entities (files) from those containing the text of a document, and using a different notation (storage format). The TEI Lite DTD supports graphics encoded using the CGM, PNG, TIFF, GIF, or JPEG standards under the SGML notation names cgm, png, tiff, gif, and jpeg respectiovely.1
Interpretation and Analysis
It is often said that all markup is a form of interpretation or analysis. While it is certainly difficult, and may be impossible, to distinguish firmly between ‘objective’ and ‘subjective’ information in any universal way, it remains true that judgments concerning the latter are typically regarded as more likely to provide controversy than those concerning the former. Many scholars therefore prefer to record such interpretations only if it is possible to alert the reader that they are considered more open to dispute, than the rest of the markup. This section describes some of the elements provided by the TEI scheme to meet this need.
Orthographic Sentences
- s
- identifies an s-unit within a document, for
purposes of establishing a simple canonical referencing scheme
covering the entire text. Attributes include
- type
- categorizes the unit (e.g. as declarative, interrogative, etc.)
General-Purpose Interpretation Elements
A more general purpose segmentation element, the seg has already been introduced for use in identifying otherwise unmarked targets of cross references and hypertext links (see section Cross References and Links); it identifies some phrase-level portion of text to which the encoder may assign a user-specified type, as well as a unique identifier; it may thus be used to tag textual features for which there is no provision in the published TEI Guidelines.
A seg element of one type (unlike the s element which it superficially resembles) can be nested within a seg element of the same or another type. This enables quite complex structures to be represented; some examples were given in section Linking Attributes above. However, because it must respect the requirement that elements be properly nested, and may not cut across each other, it cannot cope with the common requirement to associate an interpretation with arbitrary segments of a text which may completely ignore the document hierarchy. It also requires that the interpretation itself be represented by a single coded value in the type attribute.
- interp
- provides for an interpretive annotation which can be linked to
a span of text. Attributes include:
- value
- identifies the specific phenomenon being annotated.
- resp
- indicates who is responsible for the interpretation.
- type
- indicates what kind of phenomenon is being noted in the passage. Sample values include image, character, theme, allusion, or the name of a particular discourse type whose instances are being identified.
- inst
- points to instances of the analysis or interpretation represented by the current element.
- interpGrp
- collects together interp tags.
Moreover, interp is an empty element, which must be linked to the passage to which it applies either by means of the ana attribute discussed in section Linking Attributes above, or by means of its own inst attribute. This means that any kind of analysis can be represented, with no need to respect the document hierarchy, and also facilitates the grouping of analyses of a particular type together. A special purpose interpGrp element is provided for the latter purpose.
For example, suppose that you wish to mark such diverse aspects of a text as themes or subject matter, rhetorical figures, and the locations of individual scenes of the narrative. Different portions of our sample passage from Jane Eyre for example, might be associated with the rhetorical figures of apostrophe, hyperbole, and metaphor; with subject-matter references to churches, servants, cooking, postal service, and honeymoons; and with scenes located in the church, in the kitchen, and in an unspecified location (drawing room?).
Technical Documentation
Although the focus of this document is on the use of the TEI scheme for the encoding of existing ‘pre-electronic’ documents, the same scheme may also be used for the encoding of new documents. In the preparation of new documents (such as this one), XML has much to recommend it: the document's structure can be clearly represented, and the same electronic text can be re-used for many purposes — to provide both online hypertext or browsable versions and well-formatted typeset versions from a common source for example.
To facilitate this, a small number of additional elements are included in TEI Lite as extensions of the main TEI DTD, for use in marking particular features of technical documents in general, and of XML-related documents in particular.
Additional Elements for Technical Documents
- eg
- contains a single short example of some technical topic being discussed, e.g. a code fragment or a sample of SGML encoding.
- code
- contains a short fragment of code in some formal language (often a programming language).
- ident
- contains an identifier of some kind, e.g. a variable name or the name of an XML element or attribute.
- gi
- contains a special type of identifier: an XML generic identifier, or element name.
- kw
- contains a keyword in some formal language.
- formula
- contains a mathematical or chemical formula, optionally
presented in some non-XML notation. Attributes include:
- notation
- specifies the notation used to represent the body of the formula. Default value is tex, meaning the formula is represented using the TeX typesetting system.
A formatting application, given a text like that above, can be instructed to format examples appropriately (e.g. to preserve line breaks, or to use a distinctive font). Similarly, the use of tags such as ident and kw greatly facilitates the construction of a useful index.
The Tex notation is not pre-defined for the TEI Lite DTD; and must therefore be defined by a notation declaration within the DTD subset.
The list element used within the example above will not be regarded as forming part of the document proper, because it is embedded within a marked section (beginning with the special markup declaration <![CDATA[ , and ending with ]]>).
Note also the use of the gi element to tag references to element names (or generic identifiers) within the body of the text.
Generated Divisions
- divGen
- indicates the location at which a textual division generated
automatically by a text-processing application is to appear.
Attributes include:
- type
- specifies what type of generated text division (e.g. index, table of contents, etc.) is to appear. Sample values include: index (an index is to be generated and inserted at this point), toc (a table of contents) figlist (a list of figures) tablist (a list of tables).
This example also demonstrates the use of the type attribute to distinguish the different kinds of division to be generated: in the first case a table of contents (a toc) and in the second an index.
When an existing index or table of contents is to be encoded (rather than one being generated) for some reason, the list element discussed in section Lists should be used.
Index Generation
While production of a table of contents from a properly tagged document is generally unproblematic for an automatic processor, the production of a good quality index will often require more careful tagging. It may not be enough simply to produce a list of all parts tagged in some particular way, although extracting (for example) all occurrences of elements such as term or name will often be a good departure point for an index.
- index
- marks a location to be indexed for some purpose. Attributes
include:
- level1
- gives the main form of the index entry.
- level2
- gives the second-level form, if any.
- level3
- gives the third-level form, if any.
- level4
- gives the fourth-level form, if any.
- index
- indicates which index (of several) the index entry belongs to.
Character Sets, Diacritics, etc.
With the advent of XML and its adoption of Unicode as the required character set for all documents, most problems previously associated with the representation of the divers languages and writing systems of the world are greatly reduced. For those working with standard forms of the European languages in particular, almost no special action is needed: any XML editor should enable you to input accented letters or other ‘non-ASCII’ characters directly, and they should be stored in the resulting file in a way which is transferable directly between different systems, whether as Unicode characters or as character entity references.
For compatability with other older systems, however, the TEI Lite DTD includes declarations for a number of the most widely used character entities, so that such characters may be entered and saved as character mnemonics.
You may use your own entity names in TEI-conformant files, if you wish and if you provide entity declarations for them, mapping the name to the appropriate Unicode value. The standard names (though long-winded) have the advantage of clarity; the characters intended are reasonably clear to any speaker of English who recognizes that a character is being named, often even without recourse to any list. This is not true of many older schemes for representing accented characters.
- digraphs
- Form entity names for digraphs by appending the string lig to the letters forming the digraph. If a capitalized form is required, both letters are given in upper case (remember that case is usually significant in entity names). E.g.: aelig (æ), AElig (Æ) szlig (ß).
- diacritics and accents
- Form entity names for accented letters in most Western European languages by appending one of the following strings to the letter bearing the accent, which may be in upper or lower case.
- umlaut
- use uml for umlaut or trema: e.g. auml (ä), Auml (Ä), euml (ë), iuml (sic: ï), ouml (ö), Ouml (Ö), uuml (ü), Uuml (Ü).
- acute
- use acute for acute or stressed accent: e.g. aacute (á), eacute (é), Eacute (É), iacute (í), oacute (ó), uacute (ú).
- grave
- use grave for grave accent: e.g. agrave (à), egrave (è), igrave (ì), ograve (ò), ugrave (ù).
- circumflex
- use circ for circumflex: e.g. acirc (â), ecirc (ê), Ecirc (Ê), icirc (î), ocirc (ô), ucirc (û).
- tilde
- use tilde for tilde: e.g. atilde (ã), Atilde (Ã), ntilde (ñ), Ntilde (Ñ), otilde (õ), Otilde (Õ).
- consonants
- The following are recommended entity names for some special consonants found in Western European languages: ccedil (ç), Ccedil (Ç), eth (lowercase eth or Anglo-Saxon/Icelandic crossed d), ETH (uppercase eth), thorn (lowercase thorn), THORN (uppercase thorn), szlig (German s-z ligature or esszett, ß).
- punctuation marks
- The following are recommended entity names for some commonly found punctuation marks: ldquo (left double quotation mark, in shape of superscript 66), rdquo (right double quotation mark, superscript 99), mdash (one-em dash), hellip (horizontal ellipsis, three closely spaced dots), rsquo (right single quote, in shape of superscript 9).
Front and Back Matter
Front Matter
For many purposes, particularly in older texts, the preliminary material such as title pages, prefatory epistles, etc., may provide very useful additional linguistic or social information. P3 provides a set of recommendations for distinguishing the textual elements most commonly encountered in front matter, which are summarized here.
Title Page
- titlePage
- contains the title page of a text, appearing within the front or back matter.
- docTitle
- contains the title of a document, including all its constituents, as given on a title page. Must be divided into titlePart elements.
- titlePart
- contains a subsection or division of the title of a work, as
indicated on a title page; also used for free-floating fragments of
the title page not part of the document title, authorship attribution,
etc. Attributes include:
- type
- specifies the role of this subdivision of the title. Suggested values include: main (main title), sub (subtitle), desc (a descriptive paraphrase of the work included in the title), and alt (alternative title).
- byline
- contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
- docAuthor
- contains the name of the author of the document, as given on the title page (often but not always contained in a byline).
- docDate
- contains the date of the document, as given (usually) on the title page.
- docEdition
- contains an edition statement as presented on a title page of a document.
- docImprint
- contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page.
- epigraph
- contains a quotation, anonymous or attributed, appearing at the start of a section or chapter, or on a title page.
Typeface distinctions should be marked with the rend attribute when necessary, as described above. Very detailed description of the letter spacing and sizing used in ornamental titles is not as yet provided for by the Guidelines. Changes of language should be marked by appropriate use of the lang attribute or the foreign element, as necessary. Names, wherever they appear, should be tagged using the name, as elsewhere.
Prefatory Matter
- foreword
- a text addressed to the reader, by the author, editor or publisher, possibly in the form of a letter.
- preface
- a text addressed to the reader, by the author, editor or publisher, possibly in the form of a letter.
- dedication
- a text (often a letter) addressed to someone other than the reader in which the author typically commends the work in hand to the attention of the person concerned.
- abstract
- a prose argument summarizing the content of the work.
- ack
- Acknowledgements.
- contents
- a table of contents (typically this should be tagged as a list).
- frontispiece
- a pictorial frontispiece, possibly including some text.
- salute
- contains a salutation or greeting prefixed to a foreword, dedicatory epistle or other division of a text, or the salutation in the closing of a letter, preface, etc.
- signed
- contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text.
- byline
- contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
- dateline
- contains a brief description of the place, date, time, etc., of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer.
- argument
- A formal list or prose description of the topics addressed by a subdivision of a text.
- cit
- A quotation from some other document, together with a bibliographic reference to its source.
- opener
- groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.
- closer
- groups together dateline, byline, salutation, and similar phrases appearing as a final group at the end of a division, especially of a letter.
Back Matter
Structural Divisions of Back Matter
- appendix
- an appendix.
- glossary
- a list of words and definitions, typically in the form of a list type=gloss.
- notes
- a series of notes.
- bibliography
- a series of bibliographic references, typically in the form of a special bibliographic-list element listBibl, whose items are individual bibl elements.
- index
- a set of index entries, possibly represented as a structured list or glossary list, with optional leading head and perhaps some paragraphs of introductory or closing text (TEI P3 defines other specialized elements for generating indices in document production, described above in section Index Generation).
- colophon
- a description at the back of the book describing where, when, and by whom it was printed; in modern books it also often gives production details and identifies the type faces used.
The Electronic Title Page
- fileDesc
- contains a full bibliographic description of an electronic file.
- encodingDesc
- documents the relationship between an electronic text and the source or sources from which it was derived.
- profileDesc
- provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting.
- revisionDesc
- summarizes the revision history for a file.
- Elements whose names end in Stmt(for statement) usually enclose a group of elements recording some structured information.
- Elements whose names end in Decl (for declaration) enclose information about specific encoding practices.
- Elements whose names end in Desc (for description) contain a prose description.
The File Description
- titleStmt
- groups information about the title of a work and those responsible for its intellectual content.
- editionStmt
- groups information relating to one edition of a text.
- extent
- describes the approximate size of the electronic text as stored on some carrier medium, specified in any convenient units.
- publicationStmt
- groups information concerning the publication or distribution of an electronic or other text.
- seriesStmt
- groups information about the series, if any, to which a publication belongs.
- notesStmt
- collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description.
- sourceDesc
- supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated.
The Title Statement
- title
- contains the title of a work, whether article, book, journal, or series, including any alternative titles or subtitles.
- author
- in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item.
- sponsor
- specifies the name of a sponsoring organization or institution.
- funder
- specifies the name of an individual, institution, or organization responsible for the funding of a project or text.
- principal
- supplies the name of the principal researcher responsible for the creation of an electronic text.
- respStmt
- supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc., do not suffice or do not apply.
- resp
- contains a phrase describing the nature of a person's intellectual responsibility.
- name
- contains a proper noun or noun phrase.
The Edition Statement
- edition
- describes the particularities of one edition of a text.
- respStmt
- supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc., do not suffice or do not apply.
Determining exactly what constitutes a new edition of an electronic text is left to the encoder.
The Publication Statement
- publisher
- provides the name of the organization responsible for the publication or distribution of a bibliographic item.
- distributor
- supplies the name of a person or other agency responsible for the distribution of a text.
- authority
- supplies the name of a person or other agency responsible for making an electronic file available, other than a publisher or distributor.
- pubPlace
- contains the name of the place where a bibliographic item was published.
- address
- contains a postal or other address, for example of a publisher, an organization, or an individual.
- idno
- supplies any standard or non-standard number used to identify a
bibliographic item. Attributes include:
- type
- categorizes the number, for example as an ISBN or other standard series.
- availability
- supplies information about the availability of a text, for
example any restrictions on its use or distribution, its copyright
status, etc. Attributes include:
- status
- supplies a code identifying the current availability of the text. Sample values include restricted, unknown, and free.
- date
- contains a date in any format.
Series and Notes Statements
The seriesStmt groups information about the series, if any, to which a publication belongs. It may contain title, idno, or respStmt elements.
The notesStmt, if used, contains one or more note elements which contain a note or annotation. Some information found in the notes area in conventional bibliography has been assigned specific elements in the TEI scheme.
The Source Description
- bibl
- contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
- biblFull
- contains a fully-structured bibliographic citation, in which all components of the TEI file description are present.
- listBibl
- contains a list of bibliographic citations of any kind.
The Encoding Description
- projectDesc
- describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
- samplingDecl
- contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
- editorialDecl
- provides details of editorial principles and practices applied during the encoding of a text.
- tagsDecl
- provides detailed information about the tagging applied to an SGML document.
- refsDecl
- specifies how canonical references are constructed for this text.
- classDecl
- contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
Editorial Declarations
- correction
- how and under what circumstances corrections have been made in the text.
- normalization
- the extent to which the original source has been regularized or normalized.
- quotation
- what has been done with quotation marks in the original -- have they been retained or replaced by entity references, are opening and closing quotes distinguished, etc.
- hyphenation
- what has been done with hyphens (especially end-of-line hyphens) in the original -- have they been retained, replaced by entity references, etc.
- segmentation
- how has the text has been segmented, for example into sentences, tone-units, graphemic strata, etc.
- interpretation
- what analytic or interpretive information has been added to the text.
Tagging, Reference, and Classification Declarations
- tagUsage
- supplies information about the usage of a specific element
within the outermost text of a TEI conformant document.
Attributes include:
- gi
- the name (generic identifier) of the element indicated by the tag.
- occurs
- specifies the number of occurrences of this element within the text.
- rendition
- supplies information about the intended rendition of one or more elements.
- tagUsage
- supplies information about the usage of a specific element
within a text. Attributes include:
- occurs
- specifies the number of occurrences of this element within the text.
- ident
- specifies the number of occurrences of this element within the text which bear a distinct value for the global id attribute.
- render
- specifies the identifier of a rendition element which defines how this element is to be rendered.
The refsDecl element is used to document the way in which any standard referencing scheme built into the encoding works. In its simplest form, it consists of prose description.
- taxonomy
- defines a typology used to classify texts either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.
- bibl
- contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
- category
- contains an individual descriptive category, possibly nested within a superordinate category, within a user-defined taxonomy.
- catDesc
- describes some category within a taxonomy or text typology, in the form of a brief prose description.
Linkage between a particular text and a category within such a taxonomy is made by means of the catRef element within the textClass element, as further described below.
The Profile Description
- creation
- contains information about the creation of a text.
- langUsage
- describes the languages, sublanguages, registers, dialects, etc., represented within a text.
- textClass
- groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
- keywords
- contains a list of keywords or phrases identifying the topic or
nature of a text. Attributes include:
- scheme
- identifies the controlled vocabulary within which the set of keywords concerned is defined.
- classCode
- contains the classification code used for this text in some
standard classification system. Attributes include:
- scheme
- identifies the classification system or taxonomy in use.
- catRef
- specifies one or more defined categories within some taxonomy
or text typology. Attributes include:
- target
- identifies the categories concerned
The Revision Description
- date
- contains a date in any format.
- respStmt
- supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc., do not suffice or do not apply.
- item
- contains one component of a list.
Appendix A: List of Elements Described
Appendix A.1: Global Attributes
- ana
- links an element with its interpretation.
- corresp
- links an element with one or more other corresponding elements.
- id
- Unique identifier for the element; must begin with a letter, can contain letters, digits, hyphens, and periods.
- lang
- language of the text in this element; if not specified, language is assumed to be the same as in the surrounding context.
- n
- Name or number of this element; may be any string of characters. Often used for recording traditional reference systems.
- next
- links an element to the next element in an aggregate.
- prev
- links an element to the previous element in an aggregate.
- rend
- physical realization of the element in the copy text: italic, roman,display block, etc. Value may be any string of characters.
Appendix A.2: Elements in TEI Lite
- abbr
- contains an abbreviation of any sort; expansion may be given in the expan attribute.
- add
- contains letters, words, or phrases inserted in the text by an author, scribe, annotator, or corrector.
- address
- contains a postal or other address, for example of a publisher, an organization, or an individual.
- addrLine
- contains one line of a postal or other address.
- anchor
- specifies a location or point within a document so that it may be pointed to.
- argument
- A formal list or prose description of the topics addressed by a subdivision of a text.
- author
- in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item.
- authority
- supplies the name of a person or other agency responsible for making an electronic file available, other than a publisher or distributor.
- availability
- supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc.
- back
- contains any appendixes, etc., following the main part of a text.
- bibl
- contains a loosely-structured bibliographic citation of which the sub-components may or may not be explicitly tagged.
- biblFull
- contains a fully-structured bibliographic citation, in which all components of the TEI file description are present.
- biblScope
- defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work.
- body
- contains the whole body of a single unitary text, excluding any front or back matter.
- byline
- contains the primary statement of responsibility given for a work on its title page or at the head or end of the work.
- catDesc
- describes some category within a taxonomy or text typology, in the form of a brief prose description.
- category
- contains an individual descriptive category, possibly nested within a superordinate category, within a user-defined taxonomy.
- catRef
- specifies one or more defined categories within some taxonomy or text typology.
- cell
- contains one cell of a table.
- cit
- A quotation from some other document, together with a bibliographic reference to its source.
- classCode
- contains the classification code used for this text in some standard classification system, which is identified by the scheme attribute.
- classDecl
- contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
- closer
- groups together dateline, byline, salutation, and similar phrases appearing as a final group at the end of a division, especially of a letter.
- code
- contains a short fragment of code in some formal language (often a programming language).
- corr
- contains the correct form of a passage apparently erroneous in the copy text.
- creation
- contains information about the creation of a text.
- date
- contains a date in any format, with normalized value in the value attribute.
- dateline
- contains a brief description of the place, date, time, etc., of production of a letter, newspaper story, or other work, prefixed or suffixed to it as a kind of heading or trailer.
- del
- contains a letter, word or passage deleted, marked as deleted, or otherwise indicated as superfluous or spurious in the copy text by an author, scribe, annotator or corrector.
- distributor
- supplies the name of a person or other agency responsible for the distribution of a text.
- div
- contains a subdivision of the front, body, or back of a text.
- div ... div7
- contains a first-, second, ..., seventh-level subdivision of the front, body, or back of a text.
- divGen
- indicates the location at which a textual division generated automatically by a text-processing application is to appear; the type attribute specifies whether it is an index, table of contents, or something else.
- docAuthor
- contains the name of the author of the document, as given on the title page (often but not always contained in a byline).
- docDate
- contains the date of the document, as given (usually) on the title page.
- docEdition
- contains an edition statement as presented on a title page of a document.
- docImprint
- contains the imprint statement (place and date of publication, publisher name), as given (usually) at the foot of a title page.
- docTitle
- contains the title of a document, including all its constituents, as given on a title page. Must be divided into titlePart elements.
- edition
- describes the particularities of one edition of a text.
- editionStmt
- groups information relating to one edition of a text.
- editor
- secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc.
- editorialDecl
- provides details of editorial principles and practices applied during the encoding of a text.
- eg
- contains a single short example of some technical topic being discussed, e.g. a code fragment or a sample of SGML encoding.
- emph
- marks words or phrases which are stressed or emphasized for linguistic or rhetorical effect.
- encodingDesc
- documents the relationship between an electronic text and the source or sources from which it was derived.
- epigraph
- contains a quotation, anonymous or attributed, appearing at the start of a section or chapter, or on a title page.
- extent
- describes the approximate size of the electronic text as stored on some carrier medium, specified in any convenient units.
- figure
- marks the spot at which a graphic is to be inserted in a document. Attributes may be used to indicate an SGML entity containing the image itself (in some non-SGML notation); paragraphs within the figure element may be used to transcribe captions.
- fileDesc
- contains a full bibliographic description of an electronic file.
- foreign
- identifies a word or phrase as belonging to some language other than that of the surrounding text.
- formula
- contains a mathematical or chemical formula, optionally presented in some non-SGML notation. The notation is used to name the non-SGML notation used to transcribe the formula.
- front
- contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper.
- funder
- specifies the name of an individual, institution, or organization responsible for the funding of a project or text.
- gap
- indicates a point where material has been omitted in a transcription, whether for editorial reasons described in the TEI header, as part of sampling practice, or because the material is illegible or inaudible.
- gi
- contains a special type of identifier: an SGML generic identifier, or element name.
- gloss
- marks a word or phrase which provides a gloss or definition for some other word or phrase.
- group
- contains a number of unitary texts or groups of texts.
- head
- contains any heading, for example, the title of a section, or the heading of a list or glossary.
- hi
- marks a word or phrase as graphically distinct from the surrounding text, for reasons concerning which no claim is made.
- ident
- contains an identifier of some kind, e.g. a variable name or the name of an SGML element or attribute.
- idno
- supplies any standard or non-standard number used to identify a bibliographic item; the type attribute identifies the scheme or standard.
- imprint
- groups information relating to the publication or distribution of a bibliographic item.
- index
- marks a location to be indexed for some purpose. Attributes are used to give the main form, and second- through fourth-level forms to be entered in the index indicated.
- interp
- provides for an interpretive annotation which can be linked to a span of text. Attributes include resp, type, and value.
- interpGrp
- collects together interp tags.
- item
- contains one component of a list.
- keywords
- contains a list of keywords or phrases identifying the topic or nature of a text; if the keywords come from a controlled vocabulary, it can be identified by the scheme attribute.
- kw
- contains a keyword in some formal language.
- l
- contains a single, possibly incomplete, line of verse.
- label
- contains the label associated with an item in a list; in glossaries, marks the term being defined.
- langUsage
- describes the languages, sublanguages, registers, dialects, etc., represented within a text.
- lb
- marks the start of a new (typographic) line in some edition or version of a text.
- lg
- contains a group of verse lines functioning as a formal unit e.g. a stanza, refrain, verse paragraph, etc.
- list
- contains any sequence of items organized as a list, whether of numbered, bulletted, or other type.
- listBibl
- contains a list of bibliographic citations of any kind.
- mentioned
- marks words or phrases mentioned, not used.
- milestone
- marks the boundary between sections of a text, as indicated by changes in a standard reference system. Attributes include ed (edition), unit (page, etc.), and n (new value).
- name
- contains a proper noun or noun phrase. Attributes can indicate its type, give a normalized form, or associate it with a specific individual or thing by means of a unique identifiers.
- note
- contains a note or annotation, with attributes to indicate the type, location, and source of the note.
- notesStmt
- collects together any notes providing information about a text additional to that recorded in other parts of the bibliographic description.
- num
- contains a number, written in any form, with normalized value in the value attribute.
- opener
- groups together dateline, byline, salutation, and similar phrases appearing as a preliminary group at the start of a division, especially of a letter.
- orig
- contains the original form of a reading, for which a regularized form may be given in the attribute reg.
- p
- marks paragraphs in prose.
- pb
- marks the boundary between one page of a text and the next in a standard reference system.
- principal
- supplies the name of the principal researcher responsible for the creation of an electronic text.
- profileDesc
- provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting.
- projectDesc
- describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
- ptr
- a pointer to another location in the current document in terms of one or more identifiable elements.
- publicationStmt
- groups information concerning the publication or distribution of an electronic or other text.
- publisher
- provides the name of the organization responsible for the publication or distribution of a bibliographic item.
- pubPlace
- contains the name of the place where a bibliographic item was published.
- q
- contains a quotation or apparent quotation.
- ref
- a reference to another location in the current document, in terms of one or more identifiable elements, possibly modified by additional text or comment.
- refsDecl
- specifies how canonical references are constructed for this text.
- reg
- contains a reading which has been regularized or normalized in some sense; original reading may be given in the attribute orig.
- rendition
- supplies information about the intended rendition of one or more elements.
- resp
- contains a phrase describing the nature of a person's intellectual responsibility.
- respStmt
- supplies a statement of responsibility for someone responsible for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc., do not suffice or do not apply.
- revisionDesc
- summarizes the revision history for a file.
- row
- contains one row of a table.
- rs
- contains a general purpose name or referring string. Attributes can indicate its type, give a normalized form, or associate it with a specific individual or thing by means of a unique identifiers.
- s
- identifies an s-unit within a document, for purposes of establishing a simple canonical referencing scheme covering the entire text.
- salute
- contains a salutation or greeting prefixed to a foreword, dedicatory epistle or other division of a text, or the salutation in the closing of a letter, preface, etc.
- samplingDecl
- contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
- seg
- identifies a span or segment of text within a document so that it may be pointed to; the type attribute categorizes the segment.
- series
- contains information about the series in which a book or other bibliographic item has appeared.
- seriesStmt
- groups information about the series, if any, to which a publication belongs.
- sic
- contains text reproduced although apparently incorrect or inaccurate.
- signed
- contains the closing salutation, etc., appended to a foreword, dedicatory epistle, or other division of a text.
- soCalled
- contains a word or phrase for which the author or narrator indicates a disclaiming of responsibility, for example by the use of scare quotes or italics.
- sourceDesc
- supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated.
- sp
- contains an individual speech in a performance text, or a passage presented as such in a prose or verse text, with who attribute to identify speaker.
- speaker
- contains a special form of heading or label, giving the name of one or more speakers in a performance text or fragment.
- sponsor
- specifies the name of a sponsoring organization or institution.
- stage
- contains any kind of stage direction within a performance text or fragment.
- table
- contains text displayed in tabular form, in rows and columns.
- tagsDecl
- provides detailed information about the tagging applied to an SGML document.
- tagUsage
- supplies information about the usage of a specific element within the outermost text of a TEI conformant document.
- taxonomy
- defines a typology used to classify texts either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.
- term
- contains a single-word, multi-word or symbolic designation which is regarded as a technical term.
- textClass
- groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
- time
- contains a phrase defining a time of day in any format, with normalized value in the value attribute.
- title
- contains the title of a work, whether article, book, journal, or series, including any alternative titles or subtitles.
- titlePage
- contains the title page of a text, appearing within the front or back matter.
- titlePart
- contains a subsection or division of the title of a work, as indicated on a title page; also used for free-floating fragments of the title page not part of the document title, authorship attribution, etc.
- titleStmt
- groups information about the title of a work and those responsible for its intellectual content.
- trailer
- contains a closing title or footer appearing at the end of a division of a text.
- unclear
- contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source.
- xptr
- defines a pointer to another location in the current document or an external document.
- xref
- defines a pointer to another location in the current document or an external document, possibly modified by additional text or comment.