5 The TEI base tag sets

To construct a view of the TEI DTD, the user must always choose one base tag sets. Six of these are currently defined, for documents which are predominantly one of prose, verse, drama, transcribed speech, dictionaries, or terminological databases. Another two are provided for use with texts which combine these basic tag sets.

The choice of a base tag set determines the basic structure of all the documents with which it is to be used, reflecting the fact that subelements likely to appear within a dictionary (for example) will be entirely different in kind from those likely to appear within a letter or a novel, and even more so from those likely to be found in a transcription of spoken language. To cater for this variety, the constituents of all divisions of a TEI <text> element are not defined explicitly, but in terms of parameter entities. The mechanism used is to provide definitions like the following within the DTD, one of which the user must over-ride by supplying an appropriate declaration in the DTD subset:

<!ENTITY % TEI.prose "IGNORE">
<!ENTITY % TEI.dictionary "IGNORE">

The body of the main dtd contains a series of alternative definitions, each enclosed within an SGML marked section named after the base which it defines, as in this simplified example:

<![ %TEI.prose [
<!-- This definition is in force when the prose base is selected -->
<!-- Its effect is to define component as either paragraph or list -->
<!ENTITY % component "p|list" >
]&null;]>

<![ %TEI.dictionary [
<!--This definition is in force when the dictionary base is selected -->
<!-- Its effect is to define component as entry alone -->
<!ENTITY % component "entry" >
]&null;]>

<!-- This definition is always in force -->
<!-- Its effect is to define component.seq as one or more of -->
<!-- whatever definition of component is currently in force -->
<!ENTITY % component.seq "(%component)+">

Within the body of the DTD, elements are defined using these parameter entities only, for example:

<!ELEMENT div - - ((%component.seq)+)>

To select a base tag set a declaration such as the following should be supplied within the DTD subset for the document:

<!ENTITY % TEI.prose "INCLUDE">

This will over-ride the declaration within the TEI DTD itself, because it is given first. If no base is declared, the DTD will not compile.

The value of the parameter entity called component.seq will thus differ in different bases. In this way it is possible for the divisions of a text using the drama base (for example) to consist of speeches and stage directions, while those of a text using the dictionary base will consist of lexical entries.

5.1 Textual Divisions

Although the actual components may differ, groups of textual components are potentially grouped into higher level `division's in almost any kind of text. These higher level units may be called variously `chapters', `sections', `subdvisions', `acts' or `parts' but all seem to behave in more or less the same way: they are incomplete in themselves, and nested hierarchically. In the TEI scheme all such objects are therefore regarded as the same kind of element, called here a division.

A type attribute may be used to distinguish amongst divisions in some respect other than their hierarchic position: the values for this attribute (as for several others in the TEI scheme) are not standardized, precisely because no consensus exists, or is likely to exist, as to a generic typology. A set of legal values should however be defined for a given application, either in the TEI Header or by a user-defined modification.

In the normal case, the components of all divisions in a particular base are homogeneous --- they all use the same value for component.seq. However, the scheme also allows for two kinds of heterogeneity. If the general base is selected, together with two or more other bases, then different divisions of a text may have different constituents, though each division must itself be homogeneous. A mixed base is also defined, in which components from any selection of bases may be combined promiscuously across division boundaries.

This approach applies equally to the encoding of smaller units: rather than attempt to enumerate all the different analytic units which particular disciplines might find necessary, the TEI proposes two generic segmentation elements: one (<s>) for simple end-to-end segmentation, such as that commonly used in language corpora, roughly corresponding to the notion of orthographic sentence; the other (<seg>) for segments which can potentially self-nest. In either case, a type attribute may be used to distinguish different kinds of segment.

5.2 The TEI Class System and Modification Mechanisms

Textual features, and hence the elements which encode them, may be categorized or classified in a number of ways. The TEI scheme identifies two kinds of classification scheme: attribute classes and model classes; both are used for broadly similar purposes.

Members of an attribute class share the same set of attributes. For example, all elements which represent links or associations between one element and another do so using a common set of attributes, defined by the pointer attribute class.

Members of a model class share the same structural properties: that is, they may appear at the same position within the SGML document structure. For example, the class divtop contains all elements (headings, epigraphs etc.) which can appear at the start of a textual division; all elements used to mark editorial corrections or omissions are members of the class edit; elements marking bibliographic citations etc. are all members of the class bibl and so on.

Elements may of course be members of more than one class. Classes may have super- and sub-classes, and properties (notably associated attributes) may be inherited. Classes are defined in the TEI dtd by means of parameter entities, and used extensively for DTD maintenance, documentation, and extension.

The TEI scheme supports three kinds of user modification: new elements may be added into existing classes, and existing elements renamed or undefined. These operations are carried out in a controlled manner, using the class system and without any need for extensive revision of the TEI DTD itself.

The process of adding a new element to a class may be illustrated as follows. Consider the model class divTop mentioned above. Simplifying somewhat, this element class is defined as follows:

<!ENTITY % x.divtop "">
<!ENTITY % m.divtop "%x.divtop head | byline | epigraph">

To add a new element (say, <keywords>) to this class, enabling it to appear anywhere in the content model that other members of the class do, all that is needed is to re-define the `x-entity' within the document type subset:

<!ENTITY % x.divtop "keywords |">

Note the trailing vertical bar, which is required. As it happens, the element <keywords> is already defined in the TEI scheme (within the header); if it were not, an element declaration would also be necessary.

Parameter entities are also used to effect the two other kinds of modification mentioned above: the ability to undefine elements, and the ability to rename them.

Within the main TEI dtd, each element definition and its associated attribute list specification is enclosed by a marked section with the same name as the element, the default value for which is "INCLUDE". Thus, to undefine the element <mentioned>, all that is needed is a declaration like the following in the DTD subset:

<!ENTITY % mentioned "IGNORE">

A similar declaration may be used to rename any element; for example, to rename <p> as <para>:

<!ENTITY % n.p "para">

This works because all references to the <p> element throughout the TEI dtd are made indirectly, using the n.p entity. Furthermore, the original name for an element is recoverable by an SGML application, because it forms the value of a global attribute teiform of declared type FIXED.

All user-defined modifications of this kind are regarded as forming an additional tag set, which is embedded within the DTD in the same way as as any other tag set, i.e. by enabling the TEI.extensions parameter entities. In this way a TEI document can make explicit the extent and nature of any modification required in the base TEI scheme for its processing. An auxiliary tag set is also provided for the documentation of additional SGML elements in a way compatible with that used for the rest of the scheme.

5.3 The global attributes

One particularly important class is the global attribute class. By default the following attributes are members of this class and may therefore be supplied for all elements in the TEI scheme:

id provides an SGML identifier for an element
n provides a possibly non-unique name or number for an element
lang specifies the language and hence the writing system used for an element
rend provides information about the rendering of an element where this is not otherwise specified

This list may be extended: for example, selecting the additional tag set for analysis will add analytic attributes to the above list. The id and n attributes allow for the identification of any element occurrence within a TEI-conformant text. Elements carrying an id attribute value may be the object of a link or cross-reference, or any of the other re-structuring mechanisms proposed by the TEI for circumventing the rigidly hierarchic structure of a simple SGML DTD. The fact that the requirement for such links is usually unpredictable is one reason for making this attribute global.

Values on id attributes must be unique (their declared value is ID). Values on the n attribute however need not be; they may be used to carry a TEI canonical reference. A method for defining the structure of such canonical reference schemes is also provided, so that documents using it can be processed automatically.

The lang attribute indicates both the language and hence the writing system applicable to the element's content, thus providing explicit support for polyglot or multiscript texts. If no value is given, that of the element's direct parent is assumed. (A number of TEI attributes have this characteristic, which is catered for by a TEI-defined keyword). The value of this element identifies a special purpose <language> element which documents the language in use, optionally associating it with an external entity in which a formal writing system declaration (WSD) may be given.

A WSD defines a language/writing system pair (for example, ``Koine Greek, using TLG Beta Code''). and is formally defined by an auxiliary DTD which allows each character to be systematically defined and documented, in terms of existing international or other standards, public or private entity sets, ad hoc transliteration schemes or explicit definitions, as well as combinations of all four.

Finally, the global rend element may be used to give information about the physical presentation of the text in the source, where this is not otherwise given. A default rendition may be specified for all elements of a given type. No specific set of values is defined for this attribute in the current draft, though it is probable that some suitable set of DSSSL primitives will be proposed in a later version.

It should be stressed that the rend element is not intended for use as a means of specifying the desired formatting of an element, except insofaras this may be determined by a desire to mimic the approximate appearance of the original text. Like other SGML applications, the TEI scheme attempts to provide elements for the encoding of those textual features deemed essential to a productive use of the encoded text; however, unlike most other SGML applications, the TEI scheme recognizes that for some, it is precisely the appearance of a text which is the object of research.

Back to table of contents
On to next section
Back to previous section