v. A Gentle Introduction to XML
Table of contents
The encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) (Bray et al. (eds.) (2006)). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters 23 Using the TEI, 1 The TEI Infrastructure, and 22 Documentation Elements of these Guidelines.
Strictly speaking, XML is a metalanguage, that is, a language used to describe other languages, in this case, markup languages. Historically, the word markup has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing.
Generalizing from that sense, we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from scriptio continua 2; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted.
By markup language we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last.
The present chapter attempts to give an informal introduction to those parts of XML of which a proper understanding is necessary to make best use of these Guidelines. The interested reader should also consult one or more of the many excellent introductory textbooks and web sites now available on the subject. 3
v.1. What's special about XML?TEI: What's special about XML?¶
- it places emphasis on descriptive rather than procedural markup;
- it distinguishes the concepts of syntactic correctness and of validity with respect to a document type definition;
- it is independent of any one hardware or software system.
- XML is extensible: it does not consist of a fixed set of tags;
- XML documents must be well-formed according to a defined syntax;
- an XML document can be formally validated against a schema of some kind;
- XML is more interested in the meaning of data than in its presentation.
Descriptive markupTEI: Descriptive markup¶
In a descriptive markup system, the markup codes used do little more than categorize parts of a document. Markup codes such as <para> or \end{list} simply identify a portion of a document and assert of it that ‘the following item is a paragraph’, or ‘this is the end of the most recently begun list’, etc. By contrast, a procedural markup system defines what processing is to be carried out at particular points in a document: ‘call procedure PARA with parameters 42, b, and x here’ or ‘move the left margin 2 quads left, move the right margin 2 quads right, skip down one line, and go to the new left margin,’ etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it.
Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a stylesheet, though it may do much more than simply define the rendition or visual appearance of a document. 4
When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface.
Types of documentTEI: Types of document¶
A second key aspect of XML is its notion of a document type: documents are regarded as having types, just as other objects processed by computers do. The type of a document is formally defined by its constituent parts and their structure. The definition of a ‘report’, for example, might be that it consisted of a ‘title’ and possibly an ‘author’, followed by an ‘abstract’ and a sequence of one or more ‘paragraphs’. Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader.
If documents are of known types, a special-purpose program (called a parser), once provided with an unambiguous definition of a document type, can check that any document claiming to be of that type does in fact conform to the specification. A parser can check that all elements specified for a particular document type are present and no others, that they are combined in appropriate ways, correctly ordered, and so forth. More significantly, different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document type information, and which can thus behave in a more ‘intelligent’ fashion.
Data independenceTEI: Data independence¶
A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and software environment to another without loss of information. The two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system). 5 This encoding is defined by an international standard, 6 which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode. 7 Unicode provides a standardised way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present.
Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as character references; see further Character References.
v.2. Textual structuresTEI: Textual structures¶
A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages.
Structural units of this kind are most often used to identify specific locations or refer to points within a text (‘the third sentence of the second paragraph in chapter ten’; ‘canto 10, line 1234’; ‘page 412’, etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (‘is the average sentence length of section 2 different from that of section 5?’ ‘how many paragraphs separate each occurrence of the word nature? how many pages?’). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text (‘the 93rd speech by Horatio in Act 2’) than for facilitating comparisons between the words used by one character and those of another, or those used by the same character at different points of the play.
In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth.
These textual structures overlap with one other in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's Tristram Shandy for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.
v.3. XML structuresTEI: XML structures¶
This section describes the simple and consistent mechanism for the markup or identification of textual structure provided by XML. It also describes the methods XML provides for the expression of rules defining how units of textual structure can meaningfully be combined in a text.
ElementsTEI: Elements¶
The technical term used in XML for a textual unit, viewed as a structural component, is element. Different types of elements are given different names, but XML provides no way of expressing the meaning of a particular type of element, other than its relationship to other element types. That is, all one can say about an element called (say) <blort> is that instances of it may (or may not) occur within elements of type <farble>, and that it may (or may not) be decomposed into elements of type <blortette>. It should be stressed that XML is entirely unconcerned with the semantics of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. That is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is generic identifier, or GI.
Content models: an exampleTEI: Content models: an example¶
An element may be empty, that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Often, however, elements of one type will be embedded (contained entirely) within elements of a different type.
- there is a single element enclosing the whole document: this is known as the root element (<anthology> in our case);
- each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another;
- a tag explicitly marks the start and end of each element.
A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions. 11 Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs.
Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged <stanza> and not occasionally <canto> or <Stanza>. An XML document in which such rules have been checked is technically known as a valid document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a schema. 12
Validating a document's structureTEI: Validating a document's structure¶
The design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. This is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text — if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. There is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis.
XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded.
An example schemaTEI: An example schema¶
A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language (http://www.w3.org/XML/Schema) defined by the W3C; and the RELAX NG language (http://relaxng.org/) originally developed within the OASIS Technical Committee and now an ISO standard 13. In this chapter, and throughout these Guidelines, we give examples using the ‘compact syntax’ of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed. 14 Although we will use the RELAX NG compact syntax for illustration in what follows, the reader should bear in mind that analogous concepts are expressed differently in other schema languages.
Note that this is not the only way in which a RELAX NG schema might be written; 15 we have adopted this idiom, however, because it matches that used throughout the rest of the Guidelines.
A RELAX NG schema expresses rules about the possible structure of a document in terms of patterns; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. The meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the left of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called element patterns and attribute patterns. In this example we see definitions for five element patterns. Note that we have used similar names for the pattern and the element which the pattern describes: so, for example, the line anthology_p = element anthology {poem_p+} defines an element pattern called anthology_p, the value of which defines an element called anthology. These naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. The name, or generic identifier, of the element follows the word ‘element’, and the content model for the element is given within the curly braces following that. Each of these parts is discussed further below.
The last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only <anthology>. This enables the validator to detect whether a particular document is well-formed but incomplete; it also simplifies the processing task by providing an ‘entry point’.
Generic identifierTEI: Generic identifier¶
Following the word ‘element’ each pattern declaration gives the generic identifier (often abbreviated to GI) of the element being defined, for example poem, heading, etc. A GI may contain letters, digits, hyphens, underscore characters, or full stops, but must begin with a letter. 16 Uppercase and lowercase letters are quite distinct: an element with the GI <foo> is not the same as an element with the GI <Foo>; the root element of a TEI-conformant document is TEI, not <tei>.
Content modelTEI: Content model¶
The second part of each declaration, enclosed in curly braces, is called the content model of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. The RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is text, as in this example: it means that the element being defined may contain any valid character data, but no elements. If an XML document is thought of as a structure like a family tree, with a single ancestor at the top (in our case, this would be <anthology>), then almost always, following the branches of the tree downwards (for example, from <anthology> to <poem> to <stanza> to <line> and <heading>) will lead eventually to text. In our example, <heading> and <line> are so defined, since their content models say text only and name no embedded elements.
Occurrence indicatorsTEI: Occurrence indicators¶
The declaration for <stanza> in the example above states that a stanza consists of one or more lines. It uses an occurrence indicator (the plus sign) to indicate how many times something matching the pattern line_p may be repeated. There are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. The plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. Thus, if the content model for <stanza> were {line_p*}, stanzas with no lines would be possible as well as those with more than one line. If it were {line_p?}, again empty stanzas would be countenanced, but no stanza could have more than a single line. The declaration for <poem> in the example above thus states that a <poem> cannot have more than one heading, but may have none, and that it must have at least one <stanza> and may have several.
ConnectorsTEI: Connectors¶
The content model {heading_p?, stanza_p+} contains more than one component, and thus needs additionally to specify the order in which these patterns (<heading_p> and <stanza_p>) may appear. This ordering is determined by the connector (the comma) used between its components. The comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a <poem> would consist of either a heading or just stanzas – but not both!
GroupsTEI: Groups¶
Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when text appears with other elements in a content model: it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them.
v.4. Complicating the issueTEI: Complicating the issue¶
In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure:
This graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML document called XPath. 19 XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as /anthology/poem[1]/stanza[2]/line[4]. The square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we left out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression /anthology/poem refers to all poems in an anthology and the expression /anthology/poem/heading refers to all their headings.
The solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the left directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression /anthology/poem//line[1] will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza.
Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an ordered hierarchy of content objects (OHCO) view of text by Renear et al. 20) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. There are many other trees that might be drawn which do not fit within the anthology model which we have presented so far. We might, for example, be interested in syntactic structures or other linguistic constructs, which rarely respect the formal boundaries of verse. Or, to take a simpler example, we might want to represent the pagination of different editions of the same text.
In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter 16 Linking, Segmentation, and Alignment of the Guidelines. These mechanisms all depend on the use of attributes, which may be used both to identify particular elements within a document and to point to, link, or align them into arbitrary structures.
v.5. AttributesTEI: Attributes¶
In the XML context, the word attribute, like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a status attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an identifier attribute so that you could refer to particular element occurrences from elsewhere within a document. Attributes are useful in precisely such circumstances.
Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named n), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as attribute-value pairs inside the start-tag for the element occurrence. An end-tag cannot contain an attribute-value specification, since it would be redundant.
The order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. The value part must always be given inside matching quotation marks, either single or double 21.
Declaring attributesTEI: Declaring attributes¶
Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute.
A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above.
- boolean
- values must be either true or false
- numeric
- values must represent a numeric quantity of some kind
- date
- values must represent a possible date and time in some calendar
Two further datatypes of particular usefulness in managing XML documents are commonly known as ID — for identifier — and URI — for Universal Resource Indicator, or pointer for short. These are discussed in the next section.
Identifiers and indicatorsTEI: Identifiers and indicators¶
It is often necessary to refer to an occurrence of one textual element from within another, an obvious example being phrases such as ‘see note 6’ or ‘as discussed in chapter 5’. When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is xml:id and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. The cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema.
The <poemRef> element has no content, but a single attribute called target. The value of this attribute must be a pointer or web reference of type anyURI; 22 furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence — a <poemRef> with no referent is an impossibility.
A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it.
Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose status attribute has the value draft, we might use an XPath such as /anthology/poem[@status='draft']. To find the headings of all such poems, we would use the XPath /anthology/poem[@status='draft']/heading.
v.6. Other components of an XML documentTEI: Other components of an XML document¶
In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called entity references. They may be useful as a means of providing ‘boilerplate’ text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called processing instructions. And, as noted earlier, an XML document may also contain instances of elements taken from some other namespace. We discuss each of these three cases in the rest of this section.
Character ReferencesTEI: Character References¶
As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation.
For example, the character é is represented within an XML document as the Unicode character with hexadecimal value 00E9. If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference é (the x indicating that what follows is a hexadecimal value) or é (its decimal equivalent). References of this type do not need to be predefined, since the underlying character encoding for XML is always the same.
To aid legibility, however, it is also possible to use a mnemonic name (such as eacute) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an entity declaration. A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string ‘T&C’ might be encoded as T&C.
There is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. These include the names used for characters such as the ampersand (amp) and the open angle bracket or less-than sign (lt), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks (quot and apos for double and single respectively), and for completeness the closing angle bracket or greater-than sign (gt).
Processing instructionsTEI: Processing instructions¶
Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information — if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup.
NamespacesTEI: Namespaces¶
A valid XML document necessarily specifies the schema in which its constituent elements are defined. However, a well-formed XML document is not required to specify its schema (indeed, it may not even have a schema). It would still be useful to indicate that the element names used in it have some defined provenance. Furthermore, it might be desirable to include in a document elements that are defined (possibly differently) in different schemas. A cabinet-maker's schema might well define an element called <table> with very different characteristics from those of a documentalist's.
The concept of namespace was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements ‘belongs to’ a given namespace, or are ‘defined by’ a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive prefix and the identifying name associated with it.
Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way.
Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C. 23 SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing software. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing software does not confuse our <line> element with the SVG <line>, which means something quite different.
Marked SectionsTEI: Marked Sections¶
We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the left angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. The predefined entities &, <, and > provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the ‘marked section’.
v.7. Putting it all togetherTEI: Putting it all together¶
- how does a processor determine the schema (or schemas) that should be used to validate a given XML document instance?
- if a document contains entity references that must be processed before the document can be validated, where are those entities defined?
- an XML document instance may be stored in a number of different operating system files; how should they be assembled together?
- how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains?
- how does a processor enforce more exact validation than simple datatypes permit (for example of element content)?
Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular software environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here.
Associating entity definitions with a document instanceTEI: Associating entity definitions with a document instance¶
In Character References we introduced the syntax used for the definition of named character entities such as eacute, which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support.
As well as, and following, the XML declaration (Processing instructions), an XML document instance may be prefixed with a special DOCTYPE statement. This declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages.
Associating a document instance with its schemaTEI: Associating a document instance with its schema¶
Different schema languages adopt entirely different attitudes to this question. A document instance may be valid according to many different schemas, each appropriate to a different processing task. In RELAX NG therefore no facility for associating a particular schema with a particular instance exists: the task is regarded as a specific case of the more general issues addressed by the general architectural framework within which RELAX NG is defined: the ISO draft standard for Document Schema Definition Languages (DSDL). 25
In W3C Schema and in the DTD schema language inherited by XML from SGML, however, a document instance can point directly to the resource or resources that may be used to validate it. In W3C Schema Language, this is usually done by means of an attribute on the root element of the document instance; for XML DTDs the DOCTYPE statement introduced in Associating entity definitions with a document instance is used for this purpose.
Fortunately, any modern XML processing software tool will provide clear ways of carrying out this task appropriate to the particular language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible.
Assembling multiple resources into a single documentTEI: Assembling multiple resources into a single document¶
As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. The XML DTD language defines a special kind of entity (a system entity) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in Character References. Neither RELAX NG nor W3C Schema directly supports this mechanism, however, and we do not discuss it further here.
An alternative way of achieving the same effect is to use a special kind of pointer element to refer to the resources that need to be assembled, in exactly the same way as we proposed for the illustration in our anthology. The W3C Recommendation XML Inclusions (XInclude) 26 defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors.
Stylesheet association and processingTEI: Stylesheet association and processing¶
As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, often but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. The association is made when the stylesheet processor is invoked, and is thus entirely application-specific.
However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see http://www.w3.org/TR/xml-stylesheet/). This Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its MIME type, for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction.
Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT.
Content validationTEI: Content validation¶
As we noted above, most schema languages provide some degree of datatype validation for attribute values (Declaring attributes). They vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. Thus, while we may very easily check that our <stanza> elements contain only <line> elements, we cannot easily check that <line> elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the status of a poem is draft we might wish to permit elements such as <editorialQuery> within its content, but not otherwise.
The XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above.
Like other XML processors, Schematron uses XPath to identify parts of an XML document; in addition, it provides elements that describe assertions to be tested and conditions which must be validated, as well as elements to report the results of the test.
↑ Contents « iv. About These Guidelines » vi. Languages and Character Sets