TEI P5: — Guidelines for Electronic Text Encoding and Interchange

1.

XML was originally developed as a way of publishing on the World Wide Web richly encoded documents such as those for which the TEI was designed. Several TEI participants contributed heavily to the development of XML, most notably XML's senior co-editor C. M. Sperberg-McQueen, who served as the North American editor for the TEI Guidelines from their inception until 1999.

↵

2.

In the ‘continuous writing’ characteristic of manuscripts from the early classical period, words are written continuously with no intervening spaces or punctuation.

↵

3.

New textbooks about XML appear at regular intervals and to select any one of them would be invidious. A useful list of pointers to introductory web sites is available from http://www.xml.org/xml/resources_focus_beginnerguide.shtml; recommended online courses include http://www.w3schools.com/xml/default.asp and http://www.ibm.com/developerworks/edu/x-dw-xmlintro-i.html.

↵

4.

We do not here discuss in any detail the ways that a stylesheet can be used or defined, nor do we discuss the popular W3C Stylesheet Languages XSLT and CSS. See further Berglund (ed.) (2006), Clark (ed.) (1999), and Lie and Bos (eds.) (1999).

↵

5.

See Extensible Markup Language (XML) 1.0, available from http://www.w3.org/TR/REC-xml, Section 2.2 Characters.

↵

6.

ISO/IEC 10646-1993 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)

↵

7.

See http://www.unicode.org/

↵

8.

Because the opening angle bracket has this special function in an XML document, special steps must be taken to use that character for other purposes (for example, as the mathematical less-than operator); see further section Character References.

↵

9.

The example is taken from William Blake's Songs of innocence and experience (1794).

↵

10.

The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each, so that this example may be regarded as TEI conformable in the sense that this term is defined in 23.3 Conformance.

↵

11.

Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed in section v.4. Complicating the issue.

↵

12.

The older terms Document Type Declaration and Document Type Definition, both abbreviated as DTD, may also be encountered. Throughout these Guidelines we use the term schema for any kind of formal document grammar.

↵

13.

ISO/IEC FDIS 19757-2 Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG

↵

14.

See further 22 Documentation Elements and 23.4 Implementation of an ODD System. In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose.

↵

15.

For a good tutorial introduction to RELAX NG, see van der Vlist (2004).

↵

16.

In XML, a single colon may also appear in a GI, where it has a special significance related to the use of namespaces, as further discussed in section Namespaces. The characters defined by Unicode as combining characters and as extenders are also permitted, as are logograms such as Chinese characters.

↵

17.

It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section v.4. Complicating the issue.

↵

18.

This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names.

↵

19.

The official specification is at Clark and DeRose (eds.) (1999); many introductory tutorials are available in the XML references cited above and elsewhere on the Web: good beginners' tutorials include http://www.w3schools.com/xpath/default.asp and http://www.zvon.org/xxl/XPathTutorial/, the latter being available in several languages.

↵

20.

See Renear et al. (1996).

↵

21.

In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities ' or "

↵

22.

The word ‘anyURI’ is a predefined name, used in schema languages to mean that any Uniform Resource Identifier (URI) may be supplied here. The accepted syntax for URIs is an Internet Standard, defined in http://tools.ietf.org/html/rfc3986. anyURI is one of the datatypes defined by the W3C Schema datatype library.

↵

23.

The W3C Recommendation is defined at http://www.w3.org/Graphics/SVG/.

↵

24.

And, indeed, for those responsible for deciding the licencing conditions if they change their minds later.

↵

25.

DSDL is a project of ISO/IEC JTC 1/SC 34 WG 1, the object of which is to ‘bring together different validation-related tasks and expressions to form a single extensible framework that allows technologies to work in series or in parallel to produce a single or a set of validation results. The extensibility of DSDL accommodates validation technologies not yet designed or specified.’ (http://dsdl.org).

↵

26.

http://www.w3.org/TR/xinclude/.

↵

27.

Currently BCP 47 comprises two Internet Engineering Task Force documents, referred to separately as RFC 4646 and RFC 4647; over time, other IETF documents may succeed these as the best current practice.

↵

28.

This will exclude all attributes where a non-textual datatype has been specified, for example tokens, boolean values or predefined value lists.

↵

29.

Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case.

↵

30.

The World Wide Web Consortium provides recommendations for two standard stylesheet languages: either CSS or XSL could be used for this purpose.

↵

31.

In essence, when an SGML parser encounters a reference to an entity of type SDATA, it supplies to the application which it is servicing the name of that entity, as found in the document, plus a pointer to a location somewhere on the local system, and what is present at that location may in turn allow or instruct the application to do one of a number of things, including looking up the entity name in a table and deriving information about the referenced entity which can trigger specific behaviours in the application appropriate to the processing of that abstract character. There is however no way to make an XML parser do anything of the kind in response to an entity reference.

↵

32.

Available at http://www.w3.org/TR/charmod.

↵

33.

available at http://www.unicode.org/reports/tr15/

↵

34.

http://www.unicode.org/ucd/

↵

35.

For further details, see The Unicode Character Property Model (Unicode Technical Report #23), at http://www.unicode.org/reports/tr23/.

↵

36.

The use of ‘surrogate’ values to represent code points beyond the 16-bit range is passed over here, since it adds a complication that does not affect the key points at issue

↵

1.

The colon is also by default a valid name character; however, it has a specific purpose in XML (to indicate namespace prefixes), and may not therefore be used in any other way within a name.

↵

2.

In former editions of these Guidelines, such elements were known metaphorically as ‘crystals’.

↵

3.

Note that in this context, phrase means any string of characters, and can apply to individual words, parts of words, and groups of words indifferently; it does not refer only to linguistically-motivated phrasal units. This may cause confusion for readers accustomed to applying the word in a more restrictive sense.

↵

4.

For more information on this highly influential family of standards, first proposed in 1969 by the International Federation of Library Associations, see http://www.ifla.org/VII/s13/pubs/isbd.htm. On the relation between the TEI proposals and other standards for bibliographic description, see further section 2.7 Note for Library Cataloguers.

↵

5.

Agencies compiling catalogues of machine-readable files are recommended to use available authority lists, such as the Library of Congress Name Authority List, for all common personal names.

↵

6.

This constraint is not however enforced by the current version of the TEI Guidelines.

↵

7.

In the case of a TEI corpus (15 Language Corpora), a tagsDecl in a corpus header will describe tag usage across the whole corpus, while one in an individual text header will describe tag usage for the individual text concerned.

↵

8.

On the milestone tag itself, what are here referred to as ‘variables’ are identified by the combination of the ed and unit attributes.

↵

9.

Although the way in which a spoken text is performed, (for example, the voice quality, loudness, etc.) might be regarded as analogous to ‘highlighting’ in this sense, these Guidelines recommend distinct elements for the encoding of such ‘highlighting’ in spoken texts. See further section 8.3.6 Shifts.

↵

10.

The Oxford English Dictionary documents the phrase to come down in the sense ‘to bring or put down; esp. to lay down money; to make a disbursement’ as being in use, mostly in colloquial or humorous contexts, from at least 1700 to the latter half of the 19th century.

↵

11.

In some contexts, the term regularization has a narrower and more specific significance than that proposed here: the reg element may be used for any kind of regularization, including normalization, standardization, and modernization.

↵

12.

The datatypes are taken from the W3C Recommendation XML Schema Part 2: Datatypes Second Edition. The permitted datatypes are:

There is one exception: these Guidelines permit a time to be expressed as only a number of hours, or as a number of hours and minutes, as per ISO 8601:2004 section 4.2.2.3 and 4.3.3. The W3C time and dateTime datatypes require that the minutes and seconds be included in the normalized value if they are to be correctly processed for example when sorting.

↵

13.

Many encoders find it convenient to retain the line breaks of the original during data entry, to simplify proofreading, but this may be done without inserting a tag for each line break of the original.

↵

14.

For example, to distinguish London as an author's name from London as a place of publication or as a component of a title.

↵

15.

Among the bibliographic software systems and subsystems consulted in the design of the biblStruct structure were BibTeX, Scribe, and ProCite. The distinctions made by all three may be preserved in biblStruct structures, though the nature of their design prevents a simple one-to-one mapping from their data elements to TEI elements. For further information, see section 3.11.4 Relationship to Other Bibliographic Schemes.

↵

16.

The analysis is not wholly unproblematic: as the text of the standard points out, the first subordinate title is subordinate only to the parallel title in French, while the second is subordinate to both the English main title and the French parallel title, without this relationship being made clear, either in the markup given in the example or in the reference structure offered by the standard.

↵

17.

The BibTeX scheme is intentionally compatible with that of Scribe, although it omits some fields used by Scribe. Hence only one list of fields is given here.

↵

18.

This decision should be recorded in the samplingDecl element of the header.

↵

19.

As with all lists of ‘suggested values’ for attributes, it is recommended that software written to handle TEI-conformant texts be prepared to recognize and handle these values when they occur, without limiting the user to the values in this list.

↵

20.

Specifically, characters in the Unicode blocks Alphabetic Presentation Forms, Arabic Presentation Forms-A, Arabic Presentation Forms-B, Letterlike Symbols, and Number Forms.

↵

21.

It should be kept in mind that any kind of text encoding is an abstraction and an interpretation of the text at hand, which will not necessarily be useful in reproducing an exact facsimile of the appearance of a manuscript.

↵

22.

For discussion of other attributes of this class, see 4.1.4 Partial and Composite Divisions.

↵

23.

As elsewhere in these Guidelines, this example has been formatted for clarity of exposition rather than correct display. Note in particular that whether an XML processor retains whitespace within the seg element or not (this can be configured by means of the xml:space attribute) this example will still require additional processing, since white space should be retained for the lower level seg elements (those of type syll) but not for the higher level one (those of type foot).

↵

24.

For a discussion of several of these see Edwards and Lampert (eds.) (1993); Johansson (1994); and Johansson et al. (1991).

↵

25.

The original is a conversation between two children and their parents, recorded in 1987, and discussed in MacWhinney (1988)

↵

26.

For the most part, the examples in this chapter use no sentence punctuation except to mark the rising intonation often found in interrogative statements; for further discussion, see section 8.4.3 Regularization of Word Forms.

↵

27.

The term was apparently first proposed by Loman and Jørgensen (1971), where it is defined as follows: ‘A text can be analysed as a sequence of segments which are internally connected by a network of syntactic relations and externally delimited by the absence of such relations with respect to neighbouring segments. Such a segment is a syntactic unit called a macrosyntagm’ (trans. S. Johansson).

↵

28.

We refer the reader to previous and current discussions of a common format for encoding dictionaries. For example, Amsler and Tompa (1988); Calzolari et al. (1990);Fought and Van Ess-Dykema; Ide and Veronis (1995); Ide et al. (1993); Ide et al. (1992); DANLEX Group (1987); and Tutin and Veronis (1998); Ide et al. (2000).

↵

29.

Tana de Gámez, ed., Simon and Schuster's International Dictionary (New York: Simon and Schuster, 1973).

↵

30.

Complications of sequence caused by marginal or interlinear insertions and deletions, which are frequent in manuscripts, or by unconventional page layouts, as in concrete poetry, magazines with imaginative graphic designers, and texts about the nature of typography as a medium, typically do not occur in dictionaries, and so are not discussed here.

↵

31.

This is a slight oversimplification. Even in conservative transcriptions, it is common to omit page numbers, signatures of gatherings, running titles and the like. The simple description above also elides, for the sake of simplicity, the difficulties of assigning a meaning to the phrase ‘original sequence’ when it is applied to the printed characters of a source text; the ‘original sequence’ retained or recovered from a conservative transcription of the editorial view is, of course, the one established during the transcription by the encoder.

↵

32.

The omission of rendition text is particularly common in systems for document production; it is considered good practice there, since automatic generation of rendition text is more reliable and more consistent than attempting to maintain it manually in the electronic text.

↵

33.

This chapter is based on the work of the European MASTER (Manuscript Access through Standards for Electronic Records) project, funded by the European Union from January 1999 to June 2001, and led by Peter Robinson, then at the Centre for Technology and the Arts at De Montfort University, Leicester (UK). Significant input also came from a TEI Workgroup headed by Consuelo W. Dutschke of the Rare Book and Manuscript Library, Columbia University (USA) and Ambrogio Piazzoni of the Biblioteca Apostolica Vaticana (IT) during 1998-2000.

↵

34.

The coordinate space may be thought of as a grid superimposed on a rectangular space. Rectangular areas of the grid are defined as four numbers a b c d: the first two identify the grid point which is at the upper left corner of the rectangle; the second two give the grid point located at the lower right corner of the rectangle. The grid point a b is understood to be the point which is located a points from the origin along the x (horizontal) axis, and b points from the origin along the y (vertical) axis.

↵

35.

The coordinate space used here is based on pixels, but the mapping between pixels and units in the coordinate space need not be one-to-one; it might be convenient to define a more delicate grid, to enable us to address much smaller parts of the image. This can be done simply by supplying appropriate values for the attributes which define the coordinate space; for example doubling them all would map each pixel to two grid points in the coordinate space.

↵

36.

The image is taken from the collection at http://ancilla.unice.fr/Illustr.html, and was digitized from a copy in the Bibliothèque Municipale de Lyon, by whose kind permission it is included here

↵

37.

The manuscript contains several other substitutions, ignored here for the sake of clarity.

↵

38.

For the sake of legibility in the example, long marks over vowels are omitted.

↵

39.

In the module described by chapter 22 Documentation Elements a similar method is used to link element descriptions to the modules or classes to which they belong, for example.

↵

40.

Strictly, a suitable value such as figurative should be added to the two place names which are presented periphrastically in the second example here, in order to preserve the distinction indicated by the choice of rs rather than name to encode them in the first version.

↵

41.

See http://earth-info.nga.mil/GandG/wgs84/index.html. The most recent revision of this standard is known as the Earth Gravity Model 1996.

↵

42.

The OGC is an international voluntary consensus standards organization whose members maintain the Geography Markup Language standard. The OGC coordinates with the ISO TC 211 standards organization to maintain consistency between OGC and ISO standards work. GML is also an ISO standard (ISO 19136:2007).

↵

43.

See http://code.google.com/apis/kml/documentation/index.html

↵

44.

Since no special purpose element is provided for this purpose by the current version of the Guidelines, such information should be provided as one or more distinct paragraphs at the end of the encodingDesc element described in section 2.3 The Encoding Description.

↵

45.

Schemes similar to that proposed here were developed in the 1960s and 1970s by researchers such as Hymes, Halliday, and Crystal and Davy, but have rarely been implemented; one notable exception being the pioneering work on the Helsinki Diachronic Corpus of English, on which see Kytö and Rissanen (1988)

↵

46.

It is particularly useful to define participants in a dramatic text in this way, since it enables the who attribute to be used to link sp elements to definitions for their speakers; see further section 7.2.2 Speeches and Speakers.

↵

47.

See in particular chapters 16 Linking, Segmentation, and Alignment, 17 Simple Analytic Mechanisms, and 18 Feature Structures.

↵

48.

We use the term alignment as a special case for the more general notion of correspondence. Using A as a short form for ‘an element with its attribute xml:id set to the value A’, and suppose elements A1, A2, and A3 occur in that order and form one group, while elements B1, B2, and B3 occur in that order and form another group. Then a relation in which A1 corresponds to B1, A2 corresponds to B2, and A3 corresponds to B3 is an alignment. On the other hand, a relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is not an alignment.

↵

49.

The type attribute on the note is used to classify the notes using the typology established in the Advertisement to the work: ‘The Imitations of the Ancients are added, to gratify those who either never read, or may have forgotten them; together with some of the Parodies, and Allusions to the most excellent of the Moderns.’ In the source text, the text of the poem shares the page with two sets of notes, one headed ‘Remarks’ and the other ‘Imitations’.

↵

50.

Since no special element is provided for this purpose in the present version of these Guidelines, the information should be supplied as a series of paragraphs at the end of the encodingDesc element described in section 2.3 The Encoding Description.

↵

51.

The URI (Universal Resource Indicator) is defined in RFC 3986

↵

52.

Like other XPointer schemes, bare names (i.e. values of xml:id references) are permitted as pointer arguments to all TEI-defined XPointer pointer scheme parameters.

↵

53.

Bare names (i.e., xml:id values), like other Xpointer schemes, are permitted as range() parameters.

↵

54.

As always seems to be the case, no two regular expression languages are precisely the same. For those used to Perl regular expressions, be warned that while in Perl the pattern tei matches any string that contains tei, in the W3C language it only matches the string ‘tei’.

↵

55.

See section 17.3 Spans and Interpretations, where the text from which this fragment is taken is analyzed.

↵

56.

The corresp attribute is thus distinct from the target attribute in that it is understood to create a double, rather than a single, link. It is also distinct from the targets attribute in that the latter lists all the identifiers of the elements that are doubly linked, whereas the corresp doubly links the element that bears the attribute with the element(s) that make up the value of the attribute.

↵

57.

See Gale and Church (1993), from which the example in the text is taken.

↵

58.

This sample is taken from a conversation collected and transcribed for the British National Corpus.

↵

59.

See section 17.1 Linguistic Segment Categories for discussion of the w and c tags that can be used in the following examples instead of the <seg type="word"> and <seg type="character"> tags.

↵

60.

An alternative way of representing this problem is discussed in chapter 21 Certainty, Precision, and Responsibility.

↵

61.

In this example, we have placed the link next to the elements that represent the alternants. It could also have been placed elsewhere in the document, perhaps within a linkGrp.

↵

62.

The variant readings are found in the commercial sheet music, the performance score, and the Broadway cast recording.

↵

63.

The version on which this text is based is the W3C Recommendation dated 20 December 2004..

↵

64.

This corresponds to the observation that overlapping XML tags reflecting a textual version of such an inclusion would not even be well-formed XML. This kind of overlap in textual phenomena of interest is in fact the major reason that stand-off markup is needed.

↵

65.

Or, as they are widely known, attribute-value pairs; this term should not be confused, however, with SGML or XML attributes and their values, which are similar in concept but distinct in their formal definitions.

↵

66.

Neither this constraint, nor the requirement that the whole of the text be segmented by s elements is enforced by the current TEI schemas; such constraints may however be introduced in a later version of these Guidelines.

↵

67.

The rule marks spaces left for the missing name in the manuscript.

↵

68.

For the word-class tagging method used by CLAWS see Marshall (1983); For an overview of the system see Garside et al. (1991). The example sentence was processed using an online version of the CLAWS tagger at http://www.comp.lancs.ac.uk/ucrel/claws/trial.html

↵

69.

The recommendations of this chapter have been adopted as ISO Standard 24610-1 Language Resource Management — Feature Structures — Part One: Feature Structure Representation

↵

70.

Ways of pointing to components of a TEI document without using an XML identifier are discussed in 16.2.1 Pointing Elsewhere

↵

71.

The treatment here is largely based on the characterizations of graph types in Chartrand and Lesniak (1986)

↵

72.

That is, the three syntactic interpretations of the clause are mutually exclusive. The notion that the pertinents are in Argyll is clearly not inconsistent with the notion that both the land in Gallachalzie and the pertinents are in Argyll. The graph given here describes the possible interpretations of the clause itself, not the sets of inferences derivable from each syntactic interpretation, for which it would be convenient to use the facilities described in chapter 18 Feature Structures.

↵

73.

Jackendoff (1977)

↵

74.

The symbols e and t denote special theoretical constructs (empty category and trace respectively), which need not concern us here.

↵

75.

It has been shown, however, that it is possible to relate the different annotations in an indirect way: if the textual content of the annotations is identical, the very text can serve as a means for linking the different annotations, as described in Witt (2002).

↵

76.

Grammar based schema languages (e.g., DTD, W3C Schema, and RELAX NG) are used to define markup languages (e.g., XHTML or TEI). Rule-based schema languages (e.g., Schematron) can be used to define further constraints. Such a rule-based schema language permits a sequence of certain elements between empty elements to be legitimized or prohibited.

↵

77.

A fake namespace is given for XInclude here, to avoid the markup being interpreted literally during processing.

↵

78.

ODD is short for ‘One Document Does it all’, and was the name invented by the original TEI Editors for the predecessor of the system currently used for this purpose. See further Burnard and Sperberg-McQueen (1995) and Burnard and Rahtz (2004).

↵

79.

Excluding model.gLike is generally inadvisable however, since without it the resulting schema has no way of referencing non-Unicode characters.

↵

80.

This is not strictly the case, since the element egXML used to represent TEI examples has its own namespace, http://www.tei-c.org/ns/Examples; this is the only exception however.

↵

81.

Full namespace support does not exist in the DTD language, and therefore these techniques are available only to users of more modern schema languages such as RELAX NG or W3C Schema.

↵

82.

This module can be used to document any XML schema, and has indeed been used to document several non-TEI schemas.

↵

83.

Here and elsewhere we use the word schema to refer to any formal document grammar language, irrespective of the formalism used to represent it.

↵

84.

An ODD processor should recognize as erroneous such obvious inconsistencies as an attempt to include an elementSpec in add mode for an element which is already present in an imported module.

↵

85.

The carthago program behind the Pizza Chef application, written by Michael Sperberg-McQueen for TEI P3 and P4, went to very great efforts to get this right. The XSLT transformations used by the P5 Roma application are not as sophisticated, partly because the RELAX NG language is more forgiving than DTDs.

↵

86.

Note that deletion of required elements will cause the schema specification to acccept as valid documents which cannot be TEI Conformant, since they no longer conform to the TEI abstract model; conformance topics are addressed in more detail in 23.3 Conformance.

↵

1.

TEI ED W69, available from the TEI web site at http://www.tei-c.org/Vault/ED/edw69.htm.

↵

2.

This Workgroup was jointly sponsored by the Association for History and Computing.

↵

P5: Directrices para la codificación y el intercambio de textos electrónicos

Versiones de las directrices

Secciones populares