- i. Releases of the TEI Guidelines
- ii. Dedication
- iii. Preface and Acknowledgments
- iv. About These Guidelines
- v. A Gentle Introduction to XML
- vi. Languages and Character Sets
Notas
1.
XML was originally developed as a way of publishing on
the World Wide Web richly encoded documents such as those for which
the TEI was designed. Several TEI participants contributed heavily to
the development of XML, most notably XML's senior co-editor
C. M. Sperberg-McQueen, who served as the North American editor for
the TEI Guidelines from their inception until 1999.
↵
2.
In the
‘continuous writing’ characteristic of manuscripts from the early
classical period, words are written continuously with no intervening
spaces or punctuation.
↵
3.
New
textbooks about XML appear at regular intervals and to select any one
of them would be invidious. A useful list of pointers to introductory
web sites is available from http://www.xml.org/xml/resources_focus_beginnerguide.shtml;
recommended online courses include http://www.w3schools.com/xml/default.asp and http://www.ibm.com/developerworks/edu/x-dw-xmlintro-i.html.
↵
4.
We do not here discuss in
any detail the ways that a stylesheet can be used or defined, nor do
we discuss the popular W3C Stylesheet Languages XSLT and CSS. See
further Berglund (ed.) (2006), Clark (ed.) (1999), and
Lie and Bos (eds.) (1999).
↵
5.
See Extensible Markup
Language (XML) 1.0, available from http://www.w3.org/TR/REC-xml, Section 2.2
Characters.
↵
8.
Because
the opening angle bracket has this special function in an XML
document, special steps must be taken to use that character for other
purposes (for example, as the mathematical less-than operator); see
further section Character References.
↵
10.
The element names here have been chosen for
clarity of exposition; there is, however, a TEI element corresponding to
each, so that this example may be regarded as TEI conformable in the
sense that this term is defined in 23.3 Conformance.
↵
11.
Note that this simple example has not
addressed the problem of marking elements such as sentences
explicitly; the implications of this are discussed in section v.4. Complicating the issue.
↵
12.
The older terms
Document Type Declaration and Document Type
Definition, both abbreviated as DTD, may also be
encountered. Throughout these Guidelines we use the term
schema for any kind of formal document grammar.
↵
13.
ISO/IEC FDIS 19757-2 Document
Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based
validation -- RELAX NG
↵
14.
See further 22 Documentation Elements and 23.4 Implementation of an ODD System. In practice, the only part of a TEI element
specification not expressed using TEI-defined syntax is the content
model for an element, which is expressed using the RELAX NG schema
language for reasons of processing convenience. RELAX NG uses its own
XML vocabulary to define content models, which is adopted by the TEI
for the same purpose.
↵
16.
In XML, a single colon may also
appear in a GI, where it has a special significance related to the use
of namespaces, as further discussed in section Namespaces. The characters defined by Unicode as
combining characters and as extenders are
also permitted, as are logograms such as Chinese characters.
↵
17.
It will not have escaped the astute reader
that the fact that verse paragraphs need not start on a line boundary
seriously complicates the issue; see further section v.4. Complicating the issue.
↵
18.
This is
however a rather artificial example; XPath, for example, provides ways of distinguishing
elements in an XML structure by their position without the need to
give them distinct names.
↵
19.
The official specification is at Clark and DeRose (eds.) (1999); many
introductory tutorials are available in the XML references cited above
and elsewhere on the Web: good beginners' tutorials include http://www.w3schools.com/xpath/default.asp and http://www.zvon.org/xxl/XPathTutorial/, the latter being
available in several languages.
↵
21.
In the unlikely event that both kinds of quotation marks are needed within the
quoted string, either or both can also be presented in escaped form, using the
predefined character entities ' or "
↵
22.
The word ‘anyURI’ is a predefined name, used in
schema languages to mean that any Uniform Resource
Identifier (URI) may be supplied here. The accepted syntax for
URIs is an Internet Standard, defined in http://tools.ietf.org/html/rfc3986. anyURI
is one of the datatypes defined by the W3C
Schema datatype library.
↵
24.
And, indeed, for those
responsible for deciding the licencing conditions if they change their
minds later.
↵
25.
DSDL is
a project of ISO/IEC JTC 1/SC 34 WG 1, the object of which is to
‘bring together different validation-related tasks and expressions
to form a single extensible framework that allows technologies to work
in series or in parallel to produce a single or a set of validation
results. The extensibility of DSDL accommodates validation
technologies not yet designed or specified.’ (http://dsdl.org).
↵
27.
Currently
BCP 47 comprises two Internet Engineering Task Force documents,
referred to separately as RFC 4646 and RFC 4647; over time, other
IETF documents may succeed these as the best current
practice.
↵
28.
This will exclude all
attributes where a non-textual datatype has been specified, for
example tokens, boolean values or predefined value lists.
↵
29.
Although only Unicode
is mentioned here explicitly, it should be noted that the
character repertoire and assigned code points of Unicode and
the ISO standard 10646 are identical and maintained in a way
that ensures this continues to be the case.
↵
30.
The World Wide
Web Consortium provides recommendations for two standard
stylesheet languages: either CSS or
XSL could be used for this purpose.
↵
31.
In essence, when an SGML parser
encounters a reference to an entity of type SDATA, it supplies
to the application which it is servicing the name of that
entity, as found in the document, plus a pointer to a location
somewhere on the local system, and what is present at that
location may in turn allow or instruct the application to do
one of a number of things, including looking up the entity name
in a table and deriving information about the referenced entity
which can trigger specific behaviours in the application
appropriate to the processing of that abstract character. There
is however no way to make an XML parser do anything of the kind
in response to an entity reference.
↵
35.
For
further details, see The Unicode Character Property
Model (Unicode Technical Report #23), at http://www.unicode.org/reports/tr23/.
↵
36.
The use of ‘surrogate’ values to represent code points
beyond the 16-bit range is passed over here, since it adds a
complication that does not affect the key points at
issue
↵
1.
The
colon is also by default a valid name character; however, it has a
specific purpose in XML (to indicate namespace prefixes), and may
not therefore be used in any other way within a name.
↵
3.
Note that in this
context, phrase means any string of characters, and can
apply to individual words, parts of words, and groups of words
indifferently; it does not refer only to linguistically-motivated
phrasal units. This may cause confusion for readers accustomed to
applying the word in a more restrictive sense.
↵
4.
For more information on this highly influential family of standards, first
proposed in 1969 by the International
Federation of Library Associations, see http://www.ifla.org/VII/s13/pubs/isbd.htm.
On the relation between the TEI proposals and other standards for
bibliographic description, see further section 2.7 Note for Library Cataloguers.
↵
5.
Agencies compiling catalogues of
machine-readable files are recommended to use available authority lists,
such as the Library of Congress Name Authority List, for all common
personal names.
↵
7.
In the case
of a TEI corpus (15 Language Corpora), a tagsDecl in a corpus
header will describe tag usage across the whole corpus, while one in
an individual text header will describe tag usage for the individual
text concerned.
↵
8.
On the
milestone tag itself, what are here referred to as
‘variables’ are identified by the combination of the
ed and unit attributes.
↵
9.
Although the way in which a spoken text is performed,
(for example, the voice quality, loudness, etc.) might be regarded as
analogous to ‘highlighting’ in this sense, these
Guidelines recommend distinct elements for the encoding of such
‘highlighting’ in spoken texts. See further section
8.3.6 Shifts.
↵
10.
The
Oxford English Dictionary documents the phrase to come
down in the sense ‘to bring or put down; esp. to lay down money; to make a disbursement’ as being in use, mostly in colloquial or humorous contexts, from at
least 1700 to the latter half of the 19th century.
↵
11.
In some
contexts, the term regularization has a
narrower and more specific significance than that proposed here: the
reg element may be used for any kind of regularization,
including normalization, standardization, and
modernization.
↵
12.
The datatypes are taken from the W3C Recommendation XML Schema Part 2: Datatypes Second Edition.
The permitted datatypes are:
There
is one exception: these Guidelines permit a time to be expressed as only a number of hours, or as a number of hours and minutes,
as per ISO 8601:2004 section 4.2.2.3 and 4.3.3.
The W3C time and dateTime
datatypes require that the minutes and seconds be included in the
normalized value if they are to be correctly processed for example
when sorting.
↵
13.
Many encoders find it convenient to retain the line
breaks of the original during data entry, to simplify proofreading,
but this may be done without inserting a tag for each line break of
the original.
↵
14.
For example, to distinguish
London as an author's name from
London as a place of publication or as a
component of a title.
↵
15.
Among the bibliographic software systems
and subsystems consulted in the design of the biblStruct
structure were BibTeX, Scribe, and ProCite. The distinctions made by
all three may be preserved in biblStruct structures, though
the nature of their design prevents a simple one-to-one mapping from
their data elements to TEI elements. For further information, see
section 3.11.4 Relationship to Other Bibliographic Schemes.
↵
16.
The analysis is not wholly unproblematic: as the text of the
standard points out, the first subordinate title is subordinate only to
the parallel title in French, while the second is subordinate to both
the English main title and the French parallel title, without this
relationship being made clear, either in the markup given in the example
or in the reference structure offered by the standard.
↵
17.
The BibTeX scheme is
intentionally compatible with that of Scribe, although it omits some
fields used by Scribe. Hence only one list of fields is given
here.
↵
19.
As with all lists of ‘suggested
values’ for attributes, it is recommended that software
written to handle TEI-conformant texts be prepared to recognize and
handle these values when they occur, without limiting the user to the
values in this list.
↵
20.
Specifically,
characters in the Unicode blocks Alphabetic Presentation Forms, Arabic
Presentation Forms-A, Arabic Presentation Forms-B, Letterlike Symbols,
and Number Forms.
↵
21.
It should be kept in mind that any kind of text
encoding is an abstraction and an interpretation of the text at
hand, which will not necessarily be useful in reproducing an exact
facsimile of the appearance of a manuscript.
↵
23.
As elsewhere in these
Guidelines, this example has been formatted for clarity of exposition
rather than correct display. Note in particular that whether an XML
processor retains whitespace within the seg element or not
(this can be configured by means of the
xml:space attribute) this example will still require
additional processing, since white space should be retained for the lower level seg elements
(those of type syll) but not for the higher level
one (those of type foot).
↵
24.
For a
discussion of several of these see Edwards and Lampert (eds.) (1993); Johansson (1994); and
Johansson et al. (1991).
↵
25.
The original is a conversation between two children and
their parents, recorded in 1987, and discussed in
MacWhinney (1988)
↵
26.
For
the most part, the examples in this chapter use no sentence punctuation
except to mark the rising intonation often found in interrogative
statements; for further discussion, see section 8.4.3 Regularization of Word Forms.
↵
27.
The term was
apparently first proposed by Loman and Jørgensen (1971),
where it is defined as follows: ‘A text can be analysed as a sequence
of segments which are internally connected by a network of syntactic
relations and externally delimited by the absence of such relations with
respect to neighbouring segments. Such a segment is a syntactic unit
called a macrosyntagm’ (trans. S. Johansson).
↵
28.
We refer the reader to previous and
current discussions of a common format for encoding dictionaries. For
example, Amsler and Tompa (1988); Calzolari et al. (1990);Fought and Van Ess-Dykema; Ide and Veronis (1995); Ide et al. (1993); Ide et al. (1992); DANLEX Group (1987); and Tutin and Veronis (1998); Ide et al. (2000).
↵
29.
Tana de Gámez, ed., Simon and Schuster's International Dictionary (New
York: Simon and Schuster, 1973).
↵
30.
Complications of sequence caused by marginal or interlinear
insertions and deletions, which are frequent in manuscripts, or by
unconventional page layouts, as in concrete poetry, magazines with
imaginative graphic designers, and texts about the nature of typography
as a medium, typically do not occur in dictionaries, and so are not
discussed here.
↵
31.
This is a slight oversimplification. Even in conservative
transcriptions, it is common to omit page numbers, signatures of gatherings,
running titles and the like. The simple description above also elides, for the
sake of simplicity, the difficulties of assigning a meaning to the phrase
‘original sequence’ when it is applied to the printed characters of a
source text; the ‘original sequence’ retained or recovered from a
conservative transcription of the editorial view is, of course, the one
established during the transcription by the encoder.
↵
32.
The omission of rendition text is particularly common in systems
for document production; it is considered good practice there, since automatic
generation of rendition text is more reliable and more consistent than
attempting to maintain it manually in the electronic text.
↵
33.
This chapter is based on the work of
the European MASTER (Manuscript Access through Standards for
Electronic Records) project, funded by the European Union from January
1999 to June 2001, and led by Peter Robinson, then at the Centre for
Technology and the Arts at De Montfort University, Leicester
(UK). Significant input also came from a TEI Workgroup headed by
Consuelo W. Dutschke of the Rare Book and Manuscript Library, Columbia
University (USA) and Ambrogio Piazzoni of the Biblioteca Apostolica
Vaticana (IT) during 1998-2000.
↵
34.
The coordinate space
may be thought of as a grid superimposed on a rectangular
space. Rectangular areas of the grid are defined as four numbers a b c d: the first two identify the grid point which
is at the upper left corner of the rectangle; the second two give the
grid point located at the lower right corner of the rectangle. The
grid point a b is understood to be the point
which is located a points from the origin along
the x (horizontal) axis, and b points from the origin along the y (vertical) axis.
↵
35.
The coordinate space used here is based on pixels, but
the mapping between pixels and units in the coordinate space need not
be one-to-one; it might be convenient to define a more delicate grid,
to enable us to address much smaller parts of the image. This can be
done simply by supplying appropriate values for the attributes which
define the coordinate space; for example doubling them all would map
each pixel to two grid points in the coordinate space.
↵
36.
The image is taken
from the collection at http://ancilla.unice.fr/Illustr.html, and was digitized from a copy
in the Bibliothèque Municipale de Lyon, by whose kind permission it is
included here
↵
39.
In the module described by
chapter 22 Documentation Elements a similar method is used to link element
descriptions to the modules or classes to which they belong, for
example.
↵
40.
Strictly, a suitable
value such as figurative should be added to the two place
names which are presented periphrastically in the second example here,
in order to preserve the distinction indicated by the choice of
rs rather than name to encode them in the first
version.
↵
41.
See http://earth-info.nga.mil/GandG/wgs84/index.html. The most
recent revision of this standard is known as the Earth Gravity Model
1996.
↵
42.
The OGC is an international voluntary consensus
standards organization whose members maintain the Geography Markup
Language standard. The OGC coordinates with the ISO TC 211 standards
organization to maintain consistency between OGC and ISO standards
work. GML is also an ISO standard (ISO
19136:2007).
↵
44.
Since no special purpose element is
provided for this purpose by the current version of the Guidelines,
such information should be provided as one or more distinct paragraphs
at the end of the encodingDesc element described in section
2.3 The Encoding Description.
↵
45.
Schemes similar to that proposed here were developed
in the 1960s and 1970s by researchers such as Hymes, Halliday, and
Crystal and Davy, but have rarely been implemented; one notable
exception being the pioneering work on the Helsinki Diachronic Corpus
of English, on which see Kytö and Rissanen (1988)
↵
46.
It is particularly useful to
define participants in a dramatic text in this way, since it enables the
who attribute to be used to link sp elements to
definitions for their speakers; see further section 7.2.2 Speeches and Speakers.
↵
47.
See in particular chapters
16 Linking, Segmentation, and Alignment, 17 Simple Analytic Mechanisms, and 18 Feature Structures.
↵
48.
We use the term alignment as a
special case for the more general notion of correspondence. Using A
as a short form for ‘an element with its attribute xml:id
set to the value A’, and suppose elements A1, A2,
and A3 occur in that order and form one group, while elements B1,
B2, and B3 occur in that order and form another group. Then a
relation in which A1 corresponds to B1, A2 corresponds to B2, and
A3 corresponds to B3 is an alignment. On the other hand, a
relation in which A1 corresponds to B2, B1 to C2, and C1 to A2 is
not an alignment.
↵
49.
The type
attribute on the note is used to classify the notes using the
typology established in the Advertisement to the work: ‘The
Imitations of the Ancients are
added, to gratify those who either never read, or may have
forgotten them; together with some of the Parodies, and
Allusions to the most excellent of the Moderns.’ In the
source text, the text of the poem shares the page with two sets
of notes, one headed ‘Remarks’ and the other
‘Imitations’.
↵
50.
Since no special element is
provided for this purpose in the present version of these
Guidelines, the information should be supplied as a series of
paragraphs at the end of the encodingDesc element
described in section 2.3 The Encoding Description.
↵
52.
Like other XPointer schemes, bare names (i.e. values of
xml:id references) are permitted as pointer arguments to
all TEI-defined XPointer pointer scheme parameters.
↵
53.
Bare names (i.e., xml:id
values), like other Xpointer schemes, are permitted as range() parameters.
↵
54.
As always
seems to be the case, no two regular expression languages are
precisely the same. For those used to Perl regular expressions,
be warned that while in Perl the pattern tei
matches any string that contains tei, in
the W3C language it only matches the string ‘tei’.
↵
55.
See
section 17.3 Spans and Interpretations, where the text from which this
fragment is taken is analyzed.
↵
56.
The corresp attribute is thus distinct
from the target attribute in that it is understood
to create a double, rather than a single, link. It is also
distinct from the targets attribute in that the
latter lists all the identifiers of the elements that are
doubly linked, whereas the corresp doubly links the
element that bears the attribute with the element(s) that make
up the value of the attribute.
↵
58.
This sample is taken from
a conversation collected and transcribed for the British National
Corpus.
↵
59.
See section 17.1 Linguistic Segment Categories for discussion of the
w and c tags that can be used in the following
examples instead of the <seg type="word"> and <seg
type="character"> tags.
↵
60.
An alternative way of
representing this problem is discussed in chapter 21 Certainty, Precision, and Responsibility.
↵
61.
In this example, we have
placed the link next to the elements that represent the
alternants. It could also have been placed elsewhere in the document,
perhaps within a linkGrp.
↵
62.
The variant readings are found in the commercial sheet
music, the performance score, and the Broadway cast recording.
↵
64.
This corresponds to the observation
that overlapping XML tags reflecting a textual version of such an
inclusion would not even be well-formed XML. This kind of overlap
in textual phenomena of interest is in fact the major reason that
stand-off markup is needed.
↵
65.
Or, as they are widely known,
attribute-value pairs; this term should not be confused,
however, with SGML or XML attributes and their values, which are similar in
concept but distinct in their formal definitions.
↵
66.
Neither this
constraint, nor the requirement that the whole of the text be
segmented by s elements is enforced by the current TEI
schemas; such constraints may however be introduced in a later version
of these Guidelines.
↵
68.
For the word-class tagging method used by CLAWS see
Marshall (1983);
For an overview of the system see Garside et al. (1991). The example sentence was processed
using an online version of the CLAWS tagger at http://www.comp.lancs.ac.uk/ucrel/claws/trial.html
↵
69.
The recommendations of this chapter have
been adopted as ISO Standard 24610-1 Language Resource
Management — Feature Structures — Part One: Feature Structure Representation
↵
70.
Ways of pointing to components of a TEI document without
using an XML identifier are discussed in 16.2.1 Pointing Elsewhere
↵
71.
The treatment here is largely based on the
characterizations of graph types in Chartrand and Lesniak (1986)
↵
72.
That is, the three syntactic
interpretations of the clause are mutually exclusive. The notion that
the pertinents are in Argyll is clearly not inconsistent with the notion
that both the land in Gallachalzie and the pertinents are in Argyll.
The graph given here describes the possible interpretations of the
clause itself, not the sets of inferences derivable from each syntactic
interpretation, for which it would be convenient to use the facilities
described in chapter 18 Feature Structures.
↵
74.
The symbols
e and t denote
special theoretical constructs (empty category and
trace respectively), which need not concern us here.
↵
75.
It has been shown, however, that it
is possible to relate the different annotations in an indirect
way: if the textual content of the annotations is identical,
the very text can serve as a means for linking the different
annotations, as described in Witt (2002).
↵
76.
Grammar based schema languages (e.g., DTD, W3C
Schema, and RELAX NG) are used to define markup languages
(e.g., XHTML or TEI). Rule-based schema languages (e.g.,
Schematron) can be used to define further constraints. Such a
rule-based schema language permits a sequence of certain
elements between empty elements to be legitimized or
prohibited.
↵
77.
A fake namespace is
given for XInclude here, to avoid the markup being interpreted
literally during processing.
↵
78.
ODD
is short for ‘One Document Does it all’, and was the name
invented by the original TEI Editors for the predecessor of the system
currently used for this purpose. See further Burnard and Sperberg-McQueen (1995) and Burnard and Rahtz (2004).
↵
79.
Excluding model.gLike is
generally inadvisable however, since without it the resulting schema
has no way of referencing non-Unicode characters.
↵
80.
This is not strictly the case, since the element
egXML used to represent TEI examples has its own namespace,
http://www.tei-c.org/ns/Examples; this is the only
exception however.
↵
81.
Full namespace support does not
exist in the DTD language, and therefore these techniques are
available only to users of more modern schema languages such as RELAX
NG or W3C Schema.
↵
82.
This module can be used to document any XML schema, and
has indeed been used to document several non-TEI schemas.
↵
83.
Here and elsewhere we use the word
schema to refer to any formal document grammar
language, irrespective of the formalism used to represent it.
↵
84.
An ODD processor should recognize
as erroneous such obvious inconsistencies as an attempt to include an
elementSpec in add mode for an element which is already present
in an imported module.
↵
85.
The carthago program behind the Pizza Chef application,
written by Michael Sperberg-McQueen for TEI P3 and P4, went to very
great efforts to get this right. The XSLT transformations used by the
P5 Roma application are not as sophisticated, partly because the RELAX
NG language is more forgiving than DTDs.
↵
86.
Note that
deletion of required elements will cause the schema specification to
acccept as valid documents which cannot be TEI Conformant, since they
no longer conform to the TEI abstract model; conformance topics are
addressed in more detail in 23.3 Conformance.
↵