1 The TEI Infrastructure
Table of contents
This chapter describes the infrastructure for the encoding scheme defined by these Guidelines. It introduces the conceptual framework within which the following chapters are to be understood, and the means by which that conceptual framework is implemented. It assumes some familiarity with XML and XML schemas (see chapter v. A Gentle Introduction to XML) but is intended to be accessible to any user of these Guidelines. Other chapters supply further technical details, in particular chapter 22 Documentation Elements which describes the XML schema used to express these Guidelines themselves, and chapter 23 Using the TEI which combines a discussion of modification and conformance issues with a description of the intended behaviour of an ODD processor; these chapters should be read by anyone intending to implement a new TEI-based system.
The TEI encoding scheme consists of a number of modules, each of which declares particular XML elements and their attributes. Part of an element's declaration includes its assignment to one or more element classes. Another part defines its possible content and attributes with reference to these classes. This indirection gives the TEI system much of its strength and its flexibility. Elements may be combined more or less freely to form a schema appropriate to a particular set of requirements. It is also easy to add new elements which reference existing classes or elements to a schema, as it is to exclude some of the elements provided by any module included in a schema.
In principle, a TEI schema may be constructed using any combination of modules. However, certain TEI modules are of particular importance, and should always be included in all but exceptional circumstances: the module tei described in the present chapter is of this kind because it defines classes, macros, and datatypes which are used by all other modules. The core module, defined in chapter 3 Elements Available in All TEI Documents contains declarations for elements and attributes which are likely to be needed in almost any kind of document, and is therefore recommended for global use. The header module defined in chapter 2 The TEI Header provides declarations for the metadata elements and attributes constituting the TEI header, a component which is required for TEI conformance, while the textstructure module defined in chapter 4 Default Text Structure declares basic structural elements needed for the encoding of most book-like objects. Most schemas will therefore need to include these four modules.
The specification for a TEI schema is itself a TEI document, using elements from the module described in chapter 22 Documentation Elements: we refer to such a document informally as an ODD document, from the design goal originally formulated for the system: ‘One Document Does it all’. Stylesheets for maintaining and processing ODD documents are maintained by the TEI, and these Guidelines are also maintained as such a document. As further discussed in 23.5 Implementation of an ODD System, an ODD document can be processed to generate a schema expressed using any of the three schema languages currently in wide use: the XML DTD language, the ISO RELAX NG language, or the W3C Schema language, as well as to generate documentation such as the Guidelines and their associated web site.
The bulk of this chapter describes the TEI infrastructure module itself. Although it may be skipped at a first reading, an understanding of the topics addressed here is essential for anyone planning to take full advantage of the TEI customization techniques described in chapter 23.3 Customization.
The chapter begins by briefly characterizing each of the modules available in the TEI scheme. Section 1.2 Defining a TEI Schema describes in general terms the method of constructing a TEI schema in a specific schema language such as XML DTD language, RELAX NG, or W3C Schema.
The next and largest part of the chapter introduces the attribute and element classes used to define groups of elements and their characteristics (section 1.3 The TEI Class System).
Finally, section 1.4 Macros introduces the concept of macros, which are used to express some commonly used content models, and lists the datatypes used to constrain the range of legal values for TEI attributes (section 1.4.2 Datatype Specifications).
TEI: TEI Modules¶1.1 TEI Modules
These Guidelines define several hundred elements and attributes for marking up documents of any kind. Each definition has the following components:
- a prose description
- a formal declaration, expressed using a special-purpose XML vocabulary defined by these Guidelines in combination with elements taken from the ISO schema language RELAX NG
- usage examples
Each chapter of these Guidelines presents a group of related elements, and also defines a corresponding set of declarations, which we call a module. All the definitions are collected together in the reference sections provided as an appendix. Formal declarations for a given chapter are collected together within the corresponding module. For convenience, each element is assigned to a single module, typically for use in some specific application area, or to support a particular kind of usage. A module is thus simply a convenient way of grouping together a number of associated element declarations. In the simple case, a TEI schema is made by combining together a small number of modules, as further described in section 1.2 Defining a TEI Schema below.
The following table lists the modules defined by the current release of these Guidelines:
Module name | Formal public identifier | Where defined |
analysis | Analysis and Interpretation | 17 Simple Analytic Mechanisms |
certainty | Certainty and Uncertainty | 21 Certainty, Precision, and Responsibility |
core | Common Core | 3 Elements Available in All TEI Documents |
corpus | Metadata for Language Corpora | 15 Language Corpora |
dictionaries | Print Dictionaries | 9 Dictionaries |
drama | Performance Texts | 7 Performance Texts |
figures | Tables, Formulae, Figures | 14 Tables, Formulæ, Graphics and Notated Music |
gaiji | Character and Glyph Documentation | 5 Characters, Glyphs, and Writing Modes |
header | Common Metadata | 2 The TEI Header |
iso-fs | Feature Structures | 18 Feature Structures |
linking | Linking, Segmentation, and Alignment | 16 Linking, Segmentation, and Alignment |
msdescription | Manuscript Description | 10 Manuscript Description |
namesdates | Names, Dates, People, and Places | 13 Names, Dates, People, and Places |
nets | Graphs, Networks, and Trees | 19 Graphs, Networks, and Trees |
spoken | Transcribed Speech | 8 Transcriptions of Speech |
tagdocs | Documentation Elements | 22 Documentation Elements |
tei | TEI Infrastructure | 1 The TEI Infrastructure |
textcrit | Text Criticism | 12 Critical Apparatus |
textstructure | Default Text Structure | 4 Default Text Structure |
transcr | Transcription of Primary Sources | 11 Representation of Primary Sources |
verse | Verse | 6 Verse |
For each module listed above, the corresponding chapter gives a full description of the classes, elements, and macros which it makes available when it is included in a schema. Other chapters of these Guidelines explore other aspects of using the TEI scheme.
TEI: Defining a TEI Schema¶1.2 Defining a TEI Schema
To determine that an XML document is valid (as opposed to merely well-formed), its structure must be checked against a schema, as discussed in chapter v. A Gentle Introduction to XML. For a valid TEI document, this schema must be a conformant TEI schema, as further defined in chapter 23.4 Conformance. Local systems may allow their schema to be implicit, but for interchange purposes the schema associated with a document must be made explicit. The method of doing this recommended by these Guidelines is to provide explicitly or by reference a TEI schema specification against which the document may be validated.
A TEI-conformant schema is a specific combination of TEI modules, possibly also including additional declarations that modify the element and attribute declarations contained by each module, for example to suppress or rename some elements. The TEI provides an application-independent way of specifying a TEI schema by means of the schemaSpec element defined in chapter 22 Documentation Elements. The same system may also be used to specify a schema which extends the TEI by adding new elements explicitly, or by reference to other XML vocabularies. In either case, the specification may be processed to generate a formal schema, expressed in a variety of specific schema languages, such as XML DTD language, RELAX NG, or W3C Schema. These output schemas can then be used by an XML processor such as a validator or editor to validate or otherwise process documents. Further information about the processing of a TEI formal specification is given in chapter 23 Using the TEI.
TEI: A Simple Customization¶1.2.1 A Simple Customization
<moduleRef key="tei"/>
<moduleRef key="header"/>
<moduleRef key="core"/>
<moduleRef key="textstructure"/>
</schemaSpec>
This schema specification contains references to each of four modules, identified by the key attribute on the moduleRef element. The schema specification itself is also given an identifier (TEI-minimal). An ODD processor will generate an appropriate schema from this set of declarations, expressed using the XML DTD language, the ISO RELAX NG language, the W3C Schema language, or in principle any other adequately powerful schema language. The resulting schema may then be associated with the document instance by one of a number of different mechanisms, as further described in chapter v. A Gentle Introduction to XML. The start point (or root element) of document instances to be validated against the schema is specified by means of the start attribute. Further information about the processing of an ODD specification is given in 23.5 Implementation of an ODD System.
TEI: A Larger Customization¶1.2.2 A Larger Customization
These Guidelines introduce each of the modules making up the TEI scheme one by one, and therefore, for clarity of exposition, each chapter focusses on elements drawn from a single module. In reality, of course, the markup of a text will draw on elements taken from many different modules, partly because texts are heterogeneous objects, and partly because encoders have different goals. Some examples of this heterogeneity include:
- a text may be a collection of other texts of different types: for example, an anthology of prose, verse, and drama;
- a text may contain other smaller, embedded texts: for example, a poem or song included in a prose narrative;
- some sections of a text may be written in one form, and others in a different form: for example, a novel where some chapters are in prose, others take the form of dictionary entries, and still others the form of scenes in a play;
- an encoded text may include detailed analytic annotation, for example of rhetorical or linguistic features;
- an encoded text may combine a literal transcription with a diplomatic edition of the same or different sources;
- the description of a text may require additional specialized metadata elements, for example when describing manuscript material in detail.
The TEI provides mechanisms to support all of these and many other use cases. The architecture permits elements and attributes from any combination of modules to co-exist within a single schema. Within particular modules, elements and attributes are provided to support differing views of the ‘granularity’ of a text, for example:
- a definition of a corpus or collection as a series of TEI documents, sharing a common TEI header (see chapter 15 Language Corpora)
- a definition of composite texts which combine optional front- and back-matter with a group of collected texts, themselves possibly composite (see section 4.3.1 Grouped Texts)
- an element for the representation of embedded texts, where one narrative appears to ‘float’ within another (see section 4.3.2 Floating Texts)
Subsequent chapters of these Guidelines describe in detail markup constructs appropriate for these and many other possible features of interest. The markup constructs can be combined as needed for any given set of applications or project.
<moduleRef key="tei"/>
<moduleRef key="header"/>
<moduleRef key="core"/>
<moduleRef key="textstructure"/>
<moduleRef key="msdescription"/>
<!-- manuscript description -->
<moduleRef key="transcr"/>
<!-- transcription of primary sources -->
<moduleRef key="figures"/>
<!-- figures and tables -->
<moduleRef key="namesdates"/>
<!-- names, dates, people, and places -->
</schemaSpec>
<moduleRef key="tei"/>
<moduleRef key="core"/>
<moduleRef key="textstructure"/>
<moduleRef key="transcr"/>
</schemaSpec>
The TEI architecture also supports more detailed customization beyond the simple selection of modules. A schema may suppress elements from a module, suppress some of their attributes, change their names, or even add new elements and attributes. Detailed discussion of the kind of modification possible in this way is provided in 23.3 Customization and conformance rules relating to their application are discussed in 23.4 Conformance. These facilities are available for any schema language (though some features may not be available in all languages). The ODD language also makes it possible to combine TEI and non-TEI modules into a single schema, provided that the non-TEI module is expressed using the RELAX NG schema language (see further 22.8.2 Combining TEI and Non-TEI Modules).
TEI: The TEI Class System¶1.3 The TEI Class System
The TEI scheme distinguishes about five hundred different elements. To aid comprehension, modularity, and modification, the majority of these elements are formally classified in some way. Classes are used to express two distinct kinds of commonality among elements. The elements of a class may share some set of attributes, or they may appear in the same locations in a content model. A class is known as an attribute class if its members share attributes, and as a model class if its members appear in the same locations. In either case, an element is said to inherit properties from any classes of which it is a member.
Classes (and therefore elements which are members of those classes) may also inherit properties from other classes. For example, supposing that class A is a member (or a subclass) of class B, any element which is a member of class A will inherit not only the properties defined by class A, but also those defined by class B. In such a situation, we also say that class B is a superclass of class A. The properties of a superclass are inherited by all members of its subclasses.
A basic understanding of the classes into which the TEI scheme is organized is strongly recommended and is essential for any successful customization of the system.
TEI: Attribute Classes¶1.3.1 Attribute Classes
An attribute class groups together elements which share some set of common attributes. Attribute classes are given names composed of the prefix att.
, often followed by an adjective. For example, the members of the class att.canonical have in common a key and a ref attribute, both of which are inherited from their membership in the class rather than individually defined for each element. These attributes are said to be defined by (or inherited from) the att.canonical class. If another element were to be added to the TEI scheme for which these attributes were considered useful, the simplest way to provide them would be to make the new element a member of the att.canonical class. Note also that this method ensures that the attributes in question are always defined in the same way, taking the same default values etc., no matter which element they are attached to.
Some attribute classes are defined within the tei infrastructural module and are thus globally available. Other attribute classes are specific to particular modules and thus defined in other chapters. Attributes defined by such classes will not be available unless the module concerned is included in a schema.
The attributes provided by an attribute class are those specified by the class itself, either directly, or by inheritance from another class. For example, the attribute class att.pointing.group provides attributes domains and targFunc to all of its members. This class is however a subclass of the att.pointing class, from which its members also inherit the attributes target, targetLang and evaluate. Members of the class att.pointing will thus have these three attributes, while members of the class att.pointing.group will have all five.
Note that some modules define superclasses of an existing infrastructural class. For example, the global attribute class att.divLike makes attributes org and sample available, while the att.metrical class, which is specific to the verse module, provides attributes met, real, and rhyme. Because att.metrical is defined as a superclass of att.divLike, all five of these attributes are available to elements; the declaration for att.metrical adds its three attributes to the three already defined by att.divLike when the verse module is included in a schema. If, however, this module is not included in a schema, then the att.divLike class supplies only the two attributes first mentioned.
Attributes specific to particular modules are documented along with the relevant module rather than in the present chapter. One particular attribute class, known as att.global, is common to all modules, and is therefore described in some detail in the next section. A full list of all attribute classes is given in Appendix B Attribute Classes below.
TEI: Global Attributes¶1.3.1.1 Global Attributes
The following attributes are defined in the infrastructure module for every TEI element.
- att.global provides attributes common to all elements in the TEI encoding scheme.
xml:id (identifier) provides a unique identifier for the element bearing the attribute. n (number) gives a number (or other label) for an element, which is not necessarily unique within the document. xml:lang (language) indicates the language of the element content using a ‘tag’ generated according to BCP 47. rend [att.global.rendition] (rendition) indicates how the element in question was rendered or presented in the source text. style [att.global.rendition] contains an expression in some formal style definition language which defines the rendering or presentation used for this element in the source text rendition [att.global.rendition] points to a description of the rendering or presentation used for this element in the source text. xml:base provides a base URI reference with which applications can resolve relative URI references into absolute URI references. xml:space signals an intention about how white space should be managed by applications. source [att.global.source] specifies the source from which some aspect of this element is drawn. cert [att.global.responsibility] (certainty) signifies the degree of certainty associated with the intervention or interpretation. resp [att.global.responsibility] (responsible party) indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber.
Some of these attributes (specifically xml:id, n, xml:lang, xml:base and xml:space) are provided by the att.global attribute class itself. The others are provided by one its subclasses att.global.rendition, att.global.responsibility, or att.global.source. Their usage is discussed in the following subsections.
Several other globally-available attributes are defined by other subclasses of the att.global class. These are provided by other modules, and are therefore discussed in the chapter discussing that module. A brief summary table is provided in section 1.3.1.1.7 Other Globally Available Attributes below.
TEI: Element Identifiers and Labels¶1.3.1.1.1 Element Identifiers and Labels
The value supplied for the xml:id attribute must be a legal name, as defined in the World Wide Web Consortium's XML Recommendation. This means that it must begin with a letter, or the underscore character (‘_’), and contain no characters other than letters, digits, hyphens, underscores, full stops, and certain combining and extension characters.1
In XML names (and thus the values of xml:id in an XML TEI document) uppercase and lowercase letters are distinguished, and thus partTime and parttime are two distinctly different names, and could (though perhaps unwisely) be used to denote two different element occurrences.
For a discussion of methods of providing unique identifiers for elements, see section 3.10.2 Creating New Reference Systems.
<item n="1">About These Guidelines</item>
<item n="2">A Gentle Introduction to XML</item>
<item n="9">Verse</item>
<item n="10">Drama</item>
<item n="10">Spoken Materials </item>
<item n="12">Dictionaries</item>
</list>
<!-- ... -->
<div type="stanza" n="xlii">
<!-- ... -->
</div>
</div>
<!-- ... -->
</l>
<l n="2">
<!-- ... -->
</l>
<l n="3">
<!-- ... -->
</l>
<!-- ... -->
<l n="100">
<!-- ... -->
</l>
TEI: Language Indicators¶1.3.1.1.2 Language Indicators
The xml:lang attribute indicates the natural language and writing system applicable to the content of a given element. If it is not specified, the value is inherited from that of the immediately enclosing element. As a rule, therefore, it is simplest to specify the base language of the text on the TEI element, and allow most elements to take the default value for xml:lang; the language of an element then need be explicitly specified only for elements in languages other than the base language. For this reason, it is recommended practice to supply a default value for the xml:lang attribute, either on the TEI root element, or on both the teiHeader and the text element. The latter is appropriate in the not uncommon case where the text element in a TEI document uses a different default language from that of the TEI header attached to it. Other language shifts in the source should be explicitly identified by use of the xml:lang attribute on an element at an appropriate level wherever possible.
<teiHeader>
<!-- ... -->
</teiHeader>
<text>
<!-- ... -->
</text>
</TEI>
<teiHeader xml:lang="en">
<!-- ... -->
</teiHeader>
<text xml:lang="en">
<!-- ... -->
</text>
</TEI>
<teiHeader xml:lang="en">
<!-- ... -->
</teiHeader>
<text xml:lang="fr">
<!-- ... -->
</text>
</TEI>
<teiHeader xml:lang="en">
<!-- ... -->
</teiHeader>
<text xml:lang="fr">
<body>
<div>
<!-- chapter one is in French -->
</div>
<div xml:lang="de">
<!-- chapter two is in German -->
</div>
<div>
<!-- chapter three is French -->
</div>
<!-- ... -->
</body>
</text>
</TEI>
constitution declares <q>that no bill of attainder or <term xml:lang="la">ex post
facto</term> law shall be passed.</q> ...</p>
The values used for the xml:lang and targetLang attributes must be constructed in a particular way, using values from standard lists. See further vi.1. Language Identification.
Additional information about a particular language may be supplied in the language element within the header (see section 2.4.2 Language Usage).
TEI: Rendition Indicators¶1.3.1.1.3 Rendition Indicators
and pious; but he was equally alarmed by his knowledge of the ambitious <name rend="italics">Bohemond</name>, and his ignorance of the Transalpine chiefs:
...</p>
the ambitious <name style="font-style: italic">Bohemond</name>, and his ignorance of
the Transalpine chiefs: ...</p>
The main difference between rend attribute and style is that the value used for the former may contain one or more tokens from any vocabulary devised by the encoder, separated by space characters, whereas the value used for the latter must be a single string taken from a formally-defined style definition language such as CSS. The rend attribute values are sequence-indeterminate set of whitespace-separated tokens, whereas style values allow whitespace and sequence relationships as part of the formally-defined style definition language.
<!-- define italic style using CSS, selecting it as default for emph and hi elements -->
<rendition xml:id="IT" scheme="css"
selector="emph, hi">font-style: italic;</rendition>
<!-- define a serif font family, selecting it as default for the text element -->
<rendition xml:id="FontRoman" scheme="css"
selector="text">font-family: serif;</rendition>
</tagsDecl>
<!-- ... -->
<text>
<body>
<div>
<p rendition="#IT">
<!-- this paragraph uses the seriffed font, but is in italic-->
</p>
<p>
<!-- this paragraph uses the seriffed font, but is not in italic -->
</p>
</div>
</body>
</text>
The rendition attribute always points to one or more rendition elements, each of which defines some aspect of the rendering or appearance of the text in its original form. These details may most conveniently be described using a formal style definition language, such as CSS (Lie and Bos (eds.) (1999)) or XSL-FO (Berglund (ed.) (2006)); in some other formal language developed for a specific project; or even informally in running prose. Although languages such as CSS and XSL-FO are generally used to describe document output to screen or print, they nonetheless provide formal and precise mechanisms for describing the appearance of source documents, especially print documents, but also many aspects of manuscript documents. For example, both CSS and XSL-FO provide mechanisms for describing typefaces, weight, and styles; character and line spacing; and so on.
As noted above, the style attribute is provided for encoders wishing to describe the appearance of individual source elements using a language such as CSS directly rather than by reference to a rendition element. Its value may be any expression in the chosen formal style definition language.
Formal definition languages such as CSS typically identity a series of properties (such as font-style or margin-left) for which values are specified. A sequence of such property-value pairs makes up a stylesheet. The TEI uses such languages simply to describe the appearance of a source document, rather than to control how it should be formatted.
In the TEI scheme, it is possible to supply information about the appearance of elements within a source document in the following distinct ways:
- One or more properties may be specified as the default for a set of elements (based on an external scheme, by default CSS), using rendition elements and their selector attributes;
- One or more properties may be specified for individual element occurrences, using the rend attribute with any convenient set of one or more sequence-indeterminate tokens;
- One or more properties may be specified for individual element occurrences, using the rendition attribute to point to rendition elements;
- One or more properties may be supplied explicitly for individual element occurrences, using the style attribute.
If the same property is specified in more than one of the above ways, the one with the highest number in the list above is understood to be applicable. The resulting properties from each way are then combined to provide the full set of property-value pairs applicable to the given element, and (by default) to all of its children.
For simplicity of processing, the same formal style definition should be used throughout; however, the architecture does permit this to be varied, by using the scheme attribute to indicate a different language for one or more rendition elements. Care should be taken to ensure that such values can be meaningfully combined. Similar considerations apply to the use of the rend attribute, if this is used in combination with either rendition or style.
Note that these TEI attributes always describe the rendition or appearance of the source document, not intended output renditions, although often the two may be closely related.
TEI: Sources, certainty, and responsibility¶1.3.1.1.4 Sources, certainty, and responsibility
The source attribute is used to indicate the source of an element and its content, for example by pointing to a bibliographic citation for a quotation to indicate the source from which it derives. The target of the pointer may be an entry in a bibliographic list of some kind, or a pointer to a digital version of the source itself.
<!-- ... -->
<quote source="#chicago-15_ed">Grammatical theories
are in flux, and the more we learn, the less we
seem to know.</quote>
<!-- ... -->
</p>
<!-- ... -->
<bibl xml:id="chicago-15_ed">
<title level="m">The Chicago Manual of Style</title>,
<edition>15th edition</edition>.
<pubPlace>Chicago</pubPlace>:
<publisher>University of Chicago Press</publisher>
(<date>2003</date>),
<biblScope unit="page">p.147</biblScope>.
</bibl>
<!-- ... -->
<quote source="http://www.chicagomanualofstyle.org/15/ch05/ch05_sec002.html">Grammatical theories
are in flux, and the more we learn, the less we
seem to know.</quote>
<!-- ... -->
</p>
source="http://www.tei-c.org/Vault/P5/2.0.1/xml/tei/odd/p5subset.xml"/>
<sic>cheesemakers</sic>
<corr cert="high">peacemakers</corr>
<corr cert="low">placemakers</corr>
</choice>:
for they shall be called the children of God.
<corr cert="low" resp="#ed2">placemakers</corr>...
<!-- in the <text> ... --><lg>
<!-- ... -->
<l>Punkes, Panders, baſe extortionizing
sla<choice>
<sic>n</sic>
<corr resp="#JENSJ">u</corr>
</choice>es,</l>
<!-- ... -->
</lg>
<!-- in the <teiHeader> ... -->
<!-- ... -->
<respStmt xml:id="JENSJ">
<resp>Transcriber</resp>
<name>Janelle Jenstad</name>
</respStmt>
TEI: Evaluation of Links¶1.3.1.1.5 Evaluation of Links
Several TEI elements carry attributes whose values are defined as anyURI
, meaning that such attributes supply a link or pointer, typically expressed as a URL. Like other XML applications, the TEI allows use of a special attribute to set the context within which relative URLs are to be evaluated. The global attribute xml:base is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. We do not describe it in detail here: reference information about xml:base is provided by Marsh and Tobin (eds.) (2009)
<div xml:base="http://www.example.org/somewhere.xml">
<p>
<!--... -->
<ptr target="elsewhere.xml"/>
<!--... -->
</p>
</div>
<div>
<p>
<!--... -->
<ptr target="elsewhere.xml"/>
<!--... -->
</p>
</div>
</body>
http://www.example.org/elsewhere.xml
. The second ptr, however, is within the scope of a div which does not change the default context, and its target is therefore a document in the same directory as the current document.The xml:base attribute is intended to enable the stable resolution of relative URIs in a document after that document's context may have changed (for example as a result of being embedded in another document via XInclude). Setting the xml:base simply as a way to allow encoders to write shorter URIs is not recommended. In particular, xml:base may cause ambiguity as to the referent of same-document references in the form #id
(where id
is an xml:id). RFC 3986 states that URIs of this type should not result in the loading of a different document. The RFC therefore assumes that such references are internal to the document in which they are located. Using xml:base to denote arbitrary external bases while also using same-document references may mean that software agents deal with these links in unexpected and inconsistent ways. Further discussion of this attribute and its effect on TEI linking methods is provided in chapter 16 Linking, Segmentation, and Alignment.
TEI: XML Whitespace¶1.3.1.1.6 XML Whitespace
The global attribute xml:space provides a mechanism for indicating to systems processing an XML file how they should treat whitespace, that is, any sequences of consecutive tab (#x09), space (#x20), carriage return (#x0D) or linefeed (#x0A) characters. Like xml:id this attribute is defined as part of the XML specification and belongs to the XML namespace rather than the TEI namespace. Complete information about this attribute is provided by section 2.10 of the XML Specification; here we provide a summary of how its use affects users of the TEI scheme.
The xml:space attribute has only two permitted values: preserve and default. The first indicates that whitespace in a text node—every carriage return, every tab, etc.—should be maintained as is when the document is processed. The second (which is implied when the attribute is not supplied), indicates that whitespace should be handled ‘as appropriate’. Exactly what is deemed appropriate is left unspecified by the XML Recommendation.
<sic>1724</sic>
<corr>1728</corr>
</choice>
Similarly, the address element has a content model containing only elements: any punctuation or whitespace required between the lines of an address must therefore be supplied by the processor, as any whitespace present in the input document will be ignored.
Elements with content models of this type are comparatively unusual in the TEI: a list of them is provided in the TEI release file stripspace.xsl.model, formatted there for use as an <xsl:strip-space> command for XSL stylesheets.
<forename>Edward</forename>
<forename>George</forename>
<surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of
<placeName>Knebworth</placeName>
</roleName>
</persName>
If the default treatment described above is not appropriate for a mixed content element, the processing required may be described in the encodingDesc element of the TEI header, but generic XML processing tools may not take note of this.
Alternatively, the xml:space attribute may be supplied with a value of preserve in order to indicate that every space, tab, carriage return and linefeed character found within that element in the document being processed is significant. Typically, the result of that processing will be to retain the whitespace characters in the output. Thus if the above example began <persName xml:space="preserve">, the resulting text would most likely be rendered over five lines, indented, and with a blank line following.
The xml:space="preserve"
attribute is rarely used in TEI documents because such layout features are generally captured with less risk and more precision by using native TEI elements such as lb or space, or by using the renditional attributes described in section 1.3.1.1.3 Rendition Indicators.
TEI: Other Globally Available Attributes¶1.3.1.1.7 Other Globally Available Attributes
The following table lists for convenience other potentially available global attributes. The table specifies the name of the attribute class providing the attributes concerned, the module which must be included in a schema if the attributes are to be made available, and the section of these Guidelines where the class is discussed.
class name | module name | see further |
att.global.linking | linking | 16 Linking, Segmentation, and Alignment |
att.global.analytic | analysis | 17 Simple Analytic Mechanisms |
att.global.facs | transcr | 11.1 Digital Facsimiles |
att.global.change | transcr | 11.6 Identifying Changes and Revisions |
TEI: Model Classes¶1.3.2 Model Classes
As noted above, the members of a given TEI model class share the property that they can all appear in the same location within a document. Wherever possible, the content model of a TEI element is expressed not directly in terms of specific elements, but indirectly in terms of particular model classes. This makes content models simpler and more consistent; it also makes them much easier to understand and to modify.
Like attribute classes, model classes may have subclasses or superclasses. Just as elements inherit from a class the ability to appear in certain locations of a document (wherever the class can appear), so all members of a subclass inherit the ability to appear wherever any superclass can appear. To some extent, the class system thus provides a way of reducing the whole TEI galaxy of elements into a tidy hierarchy. This is however not entirely the case.
In fact, the nature of a given class of elements can be considered along two dimensions: as noted, it defines a set of places where the class members are permitted within the document hierarchy; it also implies a semantic grouping of some kind. For example, the very large class of elements which can appear within a paragraph comprises a number of other classes, all of which have the same structural property, but which differ in their field of application. Some are related to highlighting, while others relate to names or places, and so on. In some cases, the ‘set of places where class members are permitted’ is very constrained: it may just be within one specific element, or one class of element, for example. In other cases, elements may be permitted to appear in very many places, or in more than one such set of places.
These factors are reflected in the way that model classes are named. If a model class has a name containing part, such as model.divPart or model.biblPart then it is primarily defined in terms of its structural location. For example, those elements (or classes of element) which appear as content of a div constitute the model.divPart class; those which appear as content of a bibl constitute the model.biblPart class. If, however, a model class has a name containing like, such as model.biblLike or model.nameLike, the implication is that its members all have some additional semantic property in common, for example containing a bibliographic description, or containing some form of name, respectively. These semantically-motivated classes often provide a useful way of dividing up large structurally-motivated classes: for example, the very general structural class model.pPart.data (‘data elements that form part of a paragraph’) has four semantically-motivated member classes (model.addressLike, model.dateLike, model.measureLike, and model.nameLike), the last of these being itself a superclass with several members.
Although most classes are defined by the tei infrastructure module, a class cannot be populated unless some other specific module is included in a schema, since element declarations are contained by modules. Classes are not declared ‘top down’, but instead gain their members as a consequence of individual elements' declaration of their membership. The same class may therefore contain different members, depending on which modules are active. Consequently, the content model of a given element (being expressed in terms of model classes) may differ depending on which modules are active.
Some classes contain only a single member, even when all modules are loaded. One reason for declaring such a class is to make it easier for a customization to add new member elements in a specific place, particularly in areas where the TEI does not make fully elaborated proposals. For example, the TEI class model.rdgLike, initially empty, is expanded by the textcrit module to include just the TEI rdg element. A project wishing to add an alternative way of structuring text-critical information could do so by defining their own elements and adding it to this class.
Another reason for declaring single-member classes is where the class members are not needed in all documents, but appear in the same place as elements which are very frequently required. For example, the specialized element g used to represent a non-Unicode character or glyph is provided as the only member of the model.gLike class when the gaiji module is added to a schema. References to this class are included in almost every content model, since if it is used at all the g must be available wherever text is available; however these references have no effect unless the gaiji module is loaded.
At the other end of the scale, a few of the classes predefined by the tei module are subsequently populated with very many members. For example, the class model.pPart.edit groups all the classes of element for simple editorial correction and transcription which can appear within a p or paragraph element. The core module alone adds more than fifty elements to this class; the namesdates module adds another twenty, as does the tagdocs module. Since the p element is one of the basic building blocks of a TEI document it is not surprising that each module will need to add elements to it. The class system here provides a very convenient way of controlling the resulting complexity. Typically, elements are not added directly to these very general classes, but via some intermediate semantically-motivated class.
Just as there are a few classes which have a single member, so there are some classes which are used only once in the TEI architecture. These classes, which have no superclass and therefore do not fit into the class hierarchy defined here, are a convenient way of maintaining elements which are highly structured internally, but which appear from the outside to be uniform objects like others at the same level.2 Members of such classes can only ever appear within one element, or one class of elements. For example, the class model.addrPart is used only to express the content model for the element address; it references some other classes of elements, which can appear elsewhere, and also some elements which can only appear inside an address.
TEI: Informal Element Classifications ¶1.3.2.1 Informal Element Classifications
Most TEI elements may also be informally classified as belonging to one of the following groupings:
- divisions
- high level, possibly self-nesting, major divisions of texts. These elements populate such classes as model.divLike or model.div1Like, and typically form the largest component units of a text.
- chunks
- elements such as paragraphs and other paragraph-level elements, which can appear directly within texts or within divisions of them, but not (usually) within other chunks. These elements populate the class model.divPart, either directly or by means of other classes such as model.pLike (paragraph-like elements), model.entryLike, etc.
- phrase-level elements
- elements such as highlighted phrases, book titles, or editorial corrections which can occur only within chunks, but not between them (and thus cannot appear directly within a division). These elements populate the class model.phrase.3
The TEI also identifies two further groupings derived from these three:
- inter-level elements
- elements such as lists, notes, quotations, etc. which can appear either between chunks (as children of a div) or within them; these elements populate the class model.inter. Note that this class is not a superset of the model.phrase and model.divPart classes but rather a distinct grouping of elements which are both chunk-like and phrase-like. However, the classes model.phrase, model.pLike, and model.inter are all disjoint.
- components
- elements which can appear directly within texts or text divisions; this is a combination of the inter- and chunk- level elements defined above. These elements populate the class model.common, which is defined as a superset of the classes model.divPart, model.inter, and (when the dictionary module is included in a schema) model.entryLike.
Broadly speaking, the front, body, and back of a text each comprises a series of components, optionally grouped into divisions.
As noted above, some elements do not belong to any model class, and some model classes are not readily associated with any of the above informal groupings. However, over two-thirds of the 582 elements defined in the present edition of these Guidelines are classified in this way, and future editions of these recommendations will extend and develop this classification scheme.
A complete alphabetical list of all model classes is provided in Appendix A Model Classes.
TEI: Macros¶1.4 Macros
The infrastructure module defined by this chapter also declares a number of macros, or shortcut names for frequently occurring parts of other declarations. Macros are used in two ways in the TEI scheme: to stand for frequently-encountered content models, or parts of content models (1.4.1 Standard Content Models); and to stand for attribute datatypes (1.4.2 Datatype Specifications).
TEI: Standard Content Models¶1.4.1 Standard Content Models
As far as possible, the TEI schemas use the following set of frequently-encountered content models to help achieve consistency among different elements.
- macro.paraContent (paragraph content) defines the content of paragraphs and similar elements.
- macro.limitedContent (paragraph content) defines the content of prose elements that are not used for transcription of extant materials.
- macro.phraseSeq (phrase sequence) defines a sequence of character data and phrase-level elements.
- macro.phraseSeq.limited (limited phrase sequence) defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents.
- macro.specialPara ('special' paragraph content) defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements.
- macro.xtext (extended text) defines a sequence of character data and gaiji elements.
The present version of the TEI Guidelines includes some 582 different elements. Table 4 shows, in descending order of frequency, the seven most commonly used content models.
Content model | Number of elements using this | Description |
macro.phraseSeq | 83 | defines a sequence of character data and phrase-level elements. |
macro.paraContent | 53 | defines the content of paragraphs and similar elements. |
macro.specialPara | 33 | defines the content model of elements such as notes or list items, which either contain a series of component-level elements or else have the same structure as a paragraph, containing a series of phrase-level and inter-level elements. |
macro.phraseSeq.limited | 25 | defines a sequence of character data and those phrase-level elements that are not typically used for transcribing extant documents. |
macro.xtext | 10 | defines a sequence of character data and gaiji elements. |
macro.limitedContent | 7 | defines the content of prose elements that are not used for transcription of extant materials. |
TEI: Datatype Specifications¶1.4.2 Datatype Specifications
The values which attributes may take in a TEI schema are defined, for the most part, by reference to a TEI datatype specification. Each such specification is defined in terms of other primitive datatypes, derived mostly from W3C Schema Datatypes, literal values, or other datatypes. This indirection makes it possible for a TEI application to set constraints either globally or in individual cases, by redefining the datatype definition or the reference to it respectively. In some cases, the TEI datatype includes additional usage constraints which cannot be enforced by existing schema languages, although a TEI-compliant processor should attempt to validate them (see further discussion in chapter 23.4 Conformance).
The following element is used to define a TEI datatype:
- dataSpec (datatype specification) documents a datatype.
TEI-defined datatypes may be grouped into those which define normalized values for numeric quantities, probabilities, or temporal expressions, those which define various kinds of shorthand codes or keys, and those which define pointers or links.
The following datatypes are used for attributes which are intended to hold normalized values of various kinds. First, expressions of quantity or probability:
- teidata.certainty defines the range of attribute values expressing a degree of certainty.
- teidata.probability defines the range of attribute values expressing a probability.
- teidata.numeric defines the range of attribute values used for numeric values.
- teidata.interval defines attribute values used to express an interval value.
- teidata.count defines the range of attribute values used for a non-negative integer value used as a count.
Examples of attributes using the teidata.probability datatype include degree on damage or certainty; examples of teidata.numeric include quantity on members of the att.measurement class or value on numeric; examples of teidata.count include cols on cell and table.
Next, the datatypes used for attributes which are intended to hold normalized dates or times, durations, truth values, and language identifiers:
- teidata.duration.w3c defines the range of attribute values available for representation of a duration in time using W3C datatypes.
- teidata.temporal.w3c defines the range of attribute values expressing a temporal expression such as a date, a time, or a combination of them, that conform to the W3C XML Schema Part 2: Datatypes Second Edition specification.
- teidata.truthValue defines the range of attribute values used to express a truth value.
- teidata.xTruthValue (extended truth value) defines the range of attribute values used to express a truth value which may be unknown.
- teidata.language defines the range of attribute values used to identify a particular combination of human language and writing system.
Note that in each of these cases the values used are those recommended by existing international standards: ISO 8601 as profiled by XML Schema Part 2: Datatypes Second Edition in the case of durations, times, and date; W3C Schema datatypes in the case of truth values; and BCP 47 in the case of language.
The following datatypes have more specialized uses:
- teidata.namespace defines the range of attribute values used to indicate XML namespaces as defined by the W3C Namespaces in XML Technical Recommendation.
- teidata.namespaceOrName defines attribute values which contain either an absolute namespace URI or a qualified XML name.
- teidata.outputMeasurement defines a range of values for use in specifying the size of an object that is intended for display.
- teidata.pattern defines attribute values which are expressed as a regular expression.
- teidata.point defines the data type used to express a point in cartesian space.
- teidata.pointer defines the range of attribute values used to provide a single URI, absolute or relative, pointing to some other resource, either within the current document or elsewhere.
- teidata.authority defines attribute values which derive from an authority list, which may be an enumerated list defined in the document's schema, a list or taxonomy elsewhere in the document, or an online taxonomy, gazetteer, or other authority.
- teidata.version defines the range of attribute values which may be used to specify a TEI or Unicode version number.
- teidata.versionNumber defines the range of attribute values used for version numbers.
- teidata.replacement defines attribute values which contain a replacement template.
- teidata.xpath defines attribute values which contain an XPath expression.
By far the largest number of TEI attributes take values which are coded values or names of some kind. These values may be constrained or defined in a number of different ways, each of which is given a different name, as follows:
- teidata.word defines the range of attribute values expressed as a single word or token.
- teidata.text defines the range of attribute values used to express some kind of identifying string as a single sequence of Unicode characters possibly including whitespace.
- teidata.name defines the range of attribute values expressed as an XML Name.
- teidata.enumerated defines the range of attribute values expressed as a single XML name taken from a list of documented possibilities.
- teidata.sex defines the range of attribute values used to identify human or animal sex.
- teidata.xmlName defines attribute values which contain an XML name.
- teidata.prefix defines a range of values that may function as a URI scheme name.
Attributes of type teidata.word, such as age on person, are used to supply an identifier expressed as any kind of single token or word. The TEI places a few constraints on the characters which may be used for this purpose: only Unicode characters classified as letters, digits, punctuation characters, or symbols can appear in an attribute value of this kind. Note in particular that such values cannot include whitespace characters. Legal values include cholmondeley, été, 1234, e_content, or xml:id, but not grand wazoo. Attributes of this kind are sometimes used to associate (by co-reference) elements of different types.
Where identifiers are defined externally, for example as part of a database or file system, the inability to include whitespace or other special characters in a value may be problematic. In other cases, it may also be simply more convenient to supply a short sequence of natural language words including spaces as a single value. For these reasons, we also provide a datatype teidata.text which does permit whitespace and indeed any other Unicode character. Legal values include cholmondeley, été, 1234, e-content, xml:id, and grand wazoo. This datatype should be used with care since XML will not normalize whitespace characters within it: for example the values n="a b"
(two spaces) and n="a b"
(three spaces) would be considered distinct. This case should be distinguished from that of an attribute permitting multiple values, each of which may be separated by whitespace which will be normalized (see further 22.5.3.1 Datatypes).
Attributes of type teidata.name are similar to those of type teidata.word, but with the additional constraint that they must be legal XML identifiers, as defined by the XML 1.0 specification, or successors. Hence, they may not begin with digits or punctuation characters. Legal identifiers include cholmondeley, été, e_content, or xml:id, but not grand wazoo or 1234. Attributes of this kind are typically used to represent XML element or attribute names.
Attributes of type teidata.xmlName are similar to those of type teidata.name, but with the additional constraint that they must not contain a colon character (:, U+003A). Thus attributes of this kind are used to represent XML element or attribute names that do not have a namespace prefix.
Attributes of type teidata.prefix, such as ident of prefixDef, are restricted to strings that form legal URI prefixes.4 Examples of valid values are http, https, tn3270, xmlrpc.beep, and view-source.
Attributes of type teidata.enumerated, such as new on shift or evidence supplied by att.editLike, have the same definition as teidata.word above, with the added constraint that the word supplied is taken from a specific list of possibilities. In each case, the element or class specification which includes the definition for the attribute will also contain a list of possible values, together with a prose description of their intended significance. This list may be open (in which case the list is advisory), or closed (in which case it determines the range of legal values). In this latter case, the datatype will not be teidata.enumerated, but an explicit list of the possible values.
An attribute may, of course, take more than one value of a given type, for example a list of pointer values, or a list of words. In the TEI scheme, this information is regarded as a property of the datatype element used to document the attribute in question rather than as a distinct ‘datatype’, and is provided by the minOccurs or maxOccurs attribute. See further 22.5.3.1 Datatypes.
In a small number of cases, an attribute may take a value of either one datatype or another. These cases are considered as distinct datatypes:
- teidata.probCert defines a range of attribute values which can be expressed either as a numeric probability or as a coded certainty value.
- teidata.unboundedInt defines an attribute value which can be either any non-negative integer or the string "unbounded".
- teidata.nullOrName defines attribute values which contain either the null string or an XML name.
TEI: The TEI Infrastructure Module¶1.5 The TEI Infrastructure Module
The tei module defined by this chapter is a required component of any TEI schema. It provides declarations for all datatypes, and initial declarations for the attribute classes, model classes, and macros used by other modules in the TEI scheme. Its components are listed below in alphabetical order:
- Module tei: Declarations for classes, datatypes, and macros available to all TEI modules
- Classes defined: att.ascribed att.ascribed.directed att.breaking att.cReferencing att.canonical att.citing att.damaged att.datable att.datable.w3c att.datcat att.declarable att.declaring att.dimensions att.divLike att.docStatus att.duration.iso att.duration.w3c att.editLike att.edition att.formula att.fragmentable att.global att.global.rendition att.global.responsibility att.global.source att.handFeatures att.internetMedia att.interpLike att.measurement att.media att.naming att.notated att.partials att.personal att.placement att.pointing att.pointing.group att.ranging att.resourced att.scoping att.segLike att.sortable att.spanning att.styleDef att.timed att.transcriptional att.translatable att.typed att.written model.addrPart model.addressLike model.annotationLike model.annotationPart.body model.applicationLike model.attributable model.availabilityPart model.biblLike model.biblPart model.castItemPart model.catDescPart model.certLike model.choicePart model.common model.correspActionPart model.correspContextPart model.correspDescPart model.dateLike model.descLike model.describedResource model.dimLike model.div1Like model.div2Like model.div3Like model.div4Like model.div5Like model.div6Like model.div7Like model.divBottom model.divBottomPart model.divGenLike model.divLike model.divPart model.divTop model.divTopPart model.divWrapper model.editorialDeclPart model.egLike model.emphLike model.encodingDescPart model.entryPart model.entryPart.top model.eventLike model.featureVal model.featureVal.complex model.featureVal.single model.frontPart model.frontPart.drama model.gLike model.global model.global.edit model.global.meta model.glossLike model.graphicLike model.headLike model.hiLike model.highlighted model.imprintPart model.inter model.lLike model.lPart model.labelLike model.limitedPhrase model.linePart model.listLike model.measureLike model.milestoneLike model.msItemPart model.msQuoteLike model.nameLike model.nameLike.agent model.noteLike model.objectLike model.oddDecl model.oddRef model.offsetLike model.orgPart model.orgStateLike model.pLike model.pLike.front model.pPart.data model.pPart.edit model.pPart.editorial model.pPart.msdesc model.pPart.transcriptional model.persStateLike model.personLike model.personPart model.phrase model.phrase.xml model.placeLike model.placeNamePart model.placeStateLike model.profileDescPart model.ptrLike model.publicationStmtPart.agency model.publicationStmtPart.detail model.quoteLike model.resource model.respLike model.segLike model.settingPart model.sourceDescPart model.specDescLike model.stageLike model.standOffPart model.teiHeaderPart model.textDescPart model.titlepagePart
- Macros defined: macro.limitedContent macro.paraContent macro.phraseSeq macro.phraseSeq.limited macro.specialPara macro.xtext teidata.authority teidata.certainty teidata.count teidata.duration.iso teidata.duration.w3c teidata.enumerated teidata.interval teidata.language teidata.name teidata.namespace teidata.namespaceOrName teidata.nullOrName teidata.numeric teidata.outputMeasurement teidata.pattern teidata.point teidata.pointer teidata.prefix teidata.probCert teidata.probability teidata.replacement teidata.sex teidata.temporal.iso teidata.temporal.w3c teidata.text teidata.truthValue teidata.unboundedInt teidata.version teidata.versionNumber teidata.word teidata.xTruthValue teidata.xmlName teidata.xpath
The order in which declarations are made within the infrastructure module is critical, since several class declarations refer to others, which must therefore precede them. Other constraints on the order of declarations derive from the way in which the modularity of the TEI scheme is implemented in different schema languages. The XML DTD fragment implementing this TEI module makes extensive use of parameter entities and marked sections to effect a kind of conditional construction; the RELAX NG schema fragment similarly predeclares a number of patterns with null (‘notAllowed’) values. These issues are further discussed in chapter 23.5 Implementation of an ODD System.