Contents
- Introduction
- Pointers and links in the TEI
- Changes to Attributes
- Simple use of XPointer schemes
- Replacing IDREFS
- Proposed additions to W3C Pointing schemes
Introduction
This document defines the linking methods to be used in TEI P5 including creating files that conform to the ODD tag-set. It particularly addresses the issues of the name of the linking attribute and how IDREFS attributes will be handled, as requested by Council during its meeting of Fri 14 May.
Pointers and links in the TEI
XML documents represent hierarchical markup structures with elements all of whose contents are contiguous. As not all all phenomena that the TEI Guidelines represent conform to that model, cross-references between XML elements are used to represent some textual and editorial structures. For purposes of the TEI we distinguish two applications of pointing between TEI elements. One is called linking. This is the now well-known notion of the user-selectable hypertext link, whose predecessors are the cross-references in print or manuscript codices. Links are created by authors or editors to indicate possible navigational options to a reader of an encoded text. Because a link refers to another document or another portion of the document within which it resides, a link must indicate at least one address, the location of another relevant textual object.
<ref target="#chap2">See chapter two</ref>
Cross-references and other explicitly authored hypertext links are not the only uses of addresses needed in TEI text encoding, however. Some XML cross references are used to construct representations of document structures (like the continuations of elements interrupted by other elements), that do not fit the XML model of contiguous hierarchical containment. In P5, all such addresses, which in P4 are represented using either ID/IDREF attributes or TEI Extended Pointers, will be represented by the use of the standard URI reference mechanism used on the world wide web. (Note that use of the part attribute for this purpose is not affected, as it does not use a pointer.)
<sp xml:id="king1" next="#king2" who="#King">Speak, knave or</sp> <sp who="#Fool">What would you have me say?</sp> <sp xml:id="king2" prev="#king1" who="#King">— or lose your tongue, but if it be not civil it will be forfeit in any case</sp>
Because of the wide acceptance of URI-based addressing in all sorts of software systems and the weakening of support for the traditional ID/IDREF mechanism in XML schema languages, all TEI addresses use URI references as defined in the URI specification (RFC2396). URI references are the same mechanism used by HTML to create hypertext links, and are capable of addressing single XML resources identified by a URI. In addition to the identification of a resource, a URI reference can additionally indicate a sub-portion of the resource it identifies by the use of an appropriate fragment identifier (the ‘Fragment-ID’, this is the portion of a URI reference following the first unescaped ‘#’ character). The details of the interpretation of the Fragment-ID depend on the Internet MIME-type of the resource identified by the URI.
The W3C has defined an extensible syntax for addressing within XML content, called the XPointer Framework. Within this framework, multiple addressing methods are distinguished by a named XPointer Scheme. The W3C has defined a useful but limited set of schemes for pointing within XML documents.
The TEI will define a MIME-Type for TEI documents. (The actual type will be determined during the application process, but is likely to be something like application/tei+xml.) In addition to providing useful protocol support for the publication of TEI documents on the web, the TEI Consortium is then free to define rules for the interpretation of the Fragment-ID portion of a URI reference to a TEI document. The TEI will use the W3C XPointer framework for its Fragment-ID syntax. The ability to define rules for xml/TEI documents allows the TEI to adopt the W3C's XPointer Framework, but to simultaneously use the syntax extension mechanisms of the framework to support additional pointing schemes which do not already exist, but are needed for TEI applications — for instance, arbitrary ranges within text, location by means of pointing languages like XPath, and perhaps location by means of regular expressions at some point in the future.
Each Pointer scheme used in a Fragment-ID is identified by a scheme name. A single Fragment-ID may contain several alternative ways of addressing a sub-part of an XML resource. These variants are interpreted in order; applications are free to ignore unknown schemes, so that alternative location methods (possibly less precise) can be given in any address, so that applications unable to interpret a given scheme (like a TEI extension) can use alternate means to resolve an address. The TEI accepts the W3C's recommendations, as issued in the three XPointer Recommendations (element(), xmlns(), & xpointer()), and the XLink recommendation.
Adoption of the URI Reference and its associated infrastructure offers significant advantages in terms of implementability, since many tools already exist for managing and using URIs. There are some implications for TEI markup as well.
Two critical W3C constructs will be adopted by the TEI. The xml:base attribute defined by the W3C (in XML Base) allows encoders to control the interpretation of relative URIs. This attribute will be available to TEI encoders globally.
The other construct that is the use of the xml:id attribute currently being defined by the W3C (in xml:id). This global attribute allows schema-language independent marking of XML ID attributes in the document instance. Since the simplest form of Xpointer allowed in the XPointer framework is a bare name identifying an ID attribute in the XML file, all the capabilities current ID/IDREF based pointing method will be supported, albeit with syntactic differences. Users who wish to retain the ID/IDREF mechanism used in P4 will have to use the TEI's customization mechanism to redefine the Address class. The TEI-C might provide a ready-made customization module for this purpose (perhaps easily selectable from Roma).
The use of URI references (with XPointer schemes in fragment identifiers) has several advantages:
- Unlike IDs, URI references are independent of parsing context as they can uniformly make reference to any data having a URI. ID/IDREF only works when all data is parsed at a single time.
- URIs and fragment identifiers are now very much more commonly used and understood by implementers than unparsed external entities and strings referring to IDs in those entities (which is the mechanism P4 officially recommends, although some users, including the TEI-C itself, use Sebastian's hack of adding a url attribute to <xptr> and <xref> ).
- URI references are also independent of DTD declarations, which is convenient for users of non-DTD schema languages, which is likely to include most P5 users. (Although schemas for P5 will be available in multiple schema languages including DTDs, it is reasonable to believe that most users will use the canonical RelaxNG schemas.)
- Currently, references using IDs to point to an element in the same document must use different mechanisms and attribute conventions than those referring to external elements.
Changes to Attributes
This section discussed the intended changes to the main pointing attributes of <ptr> and <ref> for P5, and reviews the rationale behind these changes.
The global xml:base attribute, will support encoder control of the interpretation of URIs within documents. It will be added to tei.global.attributes).
The current global id attribute will be replaced by the xml:id attribute.
<list> <oneOrMore> <data type="anyURI"/> </oneOrMore> </list>
Attributes that are declared in P4 as IDREF will become tei.pointer; those declared in P4 as IDREFS will become tei.pointers.
A new attribute, cref, will be added to <ptr> and <ref> . This attribute and the target attribute are mutually exclusive — i.e. specifying both is an error. This attribute's value is a canonical reference which is to be transformed into an XPointer by applying the TEI algorithm for canonical references. (See SO W 08, currently being updated.)
The resp, crdate, targType, and targOrder attributes will be removed from the a.pointer class.
Rationale for Dropping crdate, targType, and targOrder
resp and crdate in P4 provide a rudimentary way of assigning some metadata to the pointing element. We believe these should be dropped in the interests of keeping things as simple as possible, and under the ‘do it right or don't bother to do it at all’ doctrine. Perhaps the TEI would want to develop a ‘metadata for complicated encoding’ module sometime in the future.
targType provides a simple kind of pointer validation capability. We are not at all sure that anyone uses it in any meaningful way. Especially since it's so easy to imagine situations where the little constraint it provides is not at all useful (e.g., it cannot say ‘should point to an element with TEIForm='div'’; in many linguistic applications almost any element it would point at would be <seg> , anyway.).
targOrder specifies whether the order in which multiple targets are given is significant. This typically depends on the type of the link, and need not be separately specified. The significance of such orderings may itself differ in different use of the same document. For instance textual applications like parsers or speech synthesis may need to process links in a particular order, while information-retrieval or archiving applications could use any convenient ordering.
On Attribute Name
Before Council's Fri 14 May meeting in Ghent, the work group had not come to any consensus on the name of the pointing attribute known as target in P4. Since then, we have settled on leaving the attribute name as it is, target. The main competitor name was href, chosen to match the name used by HTML. This has since been rejected once it was realized that in TEI this attribute can point to multiple places (i.e., is of type tei.pointers), unlike the HTML attribute which can only point to one element.
Simple use of XPointer schemes
The simplest XPointer fragment points to an element labeled with an ID attribute. Because of the popularity of HTML, the syntax is defined so that a Fragment-ID without an explicit pointer scheme is interpreted as an IDREF. Thus, in the example above, each of the ID references is preceded by a '#' character. While these look like IDREFs with an extra character in front, they are actually relative URIs with empty paths, and thus indicate the base URI of the XML file within which they are contained.
When dealing with a document that is stored in multiple parts, best practice in pointing dictates that the relative links used to refer to IDs should include path information for all IDs stored in a separate file. In this way addresses will be correctly resolved when parsing and processing a single resource within a large document stored in several parts. For some large documents, relative URIs may indicate locations within several directories of a file system. While there are many management issues implicit in this practice, they are well addressed in many works on the management of HTML linking.
<p>In section <ptr target="COXR"/> we introduced the simplest pointer elements, <gi>ptr</gi> and <gi>ref</gi>.would be replaced by a reference such as
<p>In section <ptr target="../CO/co.odd#COXR"/> we introduced the simplest pointer elements, <gi>ptr</gi> and <gi>ref</gi>.A reference like this can be resolved appropriately, regardless of whether or not the entire P5 document is being processed.
<p>In section <ptr target="#COXR"/> we introduced the simplest pointer elements, <gi>ptr</gi> and <gi>ref</gi>.
This standard practice corresponds to that used in HTML, which also makes the linking and addressing features of the TEI familiar and easy to understand. Within large documents like the Guidelines themselves, the use of relative URIs in this way also makes it easier for authors and maintainers of the documents to resolve an address without searching or processing the entire document.
Replacing IDREFS
- keep IDREFS
- use child elements
- separate URIs in a single attribute value
The WG feels that keeping IDREFS is simply put, a bad idea. There seems to be little point to have one part of the TEI (that which in P4 are IDREFs) use XPointers and another part (that which in P4 are IDREFSs) use an entirely different, older, mechanism, even though RelaxNG has a compatibility mode. First of all, users would find it at least annoying, if not confusing. Secondly, different kinds of software would be needed to process similar kinds of links.
Using child elements to replace IDREFS attributes may make good sense in certain specific cases, particularly in those cases where a new element is being invented for P5 anyway; however, trying to change the 20 attributes in P4 that use IDREFS (5 of which are global) would be an enormous task that would change the very nature of the TEI encoding scheme, without much obvious gain.
Using multiple URIs separated by whitespace in a single attribute value seems to make the most sense. In all cases currently encoded in P4, and in lots of future cases, the URIs will in fact be nothing more than simple name fragment-IDs, i.e. things like #duck #quack #foo #bar. These are as easy to parse and process as IDREFS are. URIs with internal whitespace require escaping. But even for many URIs that point outside the current document, they will be as simple as doc2#foo doc4#bar, etc. for projects where many complex URIs are needed, and space-separated attributes would be inappropriate, an extension using elements may be used.
Proposed additions to W3C Pointing schemes
Overview
There are several features of TEI Extended Pointers that are not supported by the W3C's XPointer schemes. Currently the W3C has defined 3 pointer schemes: bare names (strings of name characters following the ‘#’, as in HTML), element() (which provides abbreviated pointing by means of child numbers), and xmlns() (which is used to declare namespaces for user-extended pointer schemes). While there is a W3C working draft of a more complete pointer scheme, supporting many but not all of the features of the TEI Extended Pointer system, there is no current or scheduled activity towards revising this draft or issuing it as a recommendation. Given a TEI MIME-Type, the TEI can define any additional addressing schemes needed, while maintaining full standards conformance with W3C recommendations and Internet RFCs.
To match the features of TEI extended pointers, The TEI will define six new Xpointer schemes: xpath(), xpath2(), range(), string-range(), left(), and right(). These schemes overlap in functionality with the W3C's xpointer() scheme draft, but are individually much simpler. Each new scheme is either a completely new facility, or a reference to an existing standard which is adopted without modification.
The new TEI pointer schemes extend the data model of link endpoints beyond the XPath concept of the node. Since XPath provides a very seful way to address Nodes and Nodesets, we preserve those abilities by incorporating XPath as a schmem. Since spans and character ranges are needed data types for linking, those data types will be returned by new pointer schemes. since the new forms of addressing also require the selection of nodes in exactly the manner that XPath already allowsm it does not make sense to duplicate the separate addressing functions already provided by the element() and proposed xpath() pointing schemes.
Therefore, the new schemes are defined simply, but with the ability to recursively use any other xpointer scheme as an argument. Not only does this separate support for spans from support for other types of addressing, but it enables applications to support pointer schemes like element() + range(), if full XPath addressing is not needed, even though spans are.
xpath(path)
The xpath scheme locates a node within an XML Information Set. The single argument path is an XPath path as defined in the W3C XPath 1 Recommendation. The node resulting from evaluating the XPath is the reference of an address using the xpath() scheme.
xpath2(path)
The xpath2() scheme locates a node within an XML Information Set. The single argument path is an XPath 2.0 path as defined in the W3C XPath 2 Recommendation. The node resulting from evaluating the XPath, is the reference of an address using the xpath2() scheme.
left(pointer) and right(pointer)
- A Node When pointer resolves to a node, the point designated is the point immediately preceding (left()) or following (right()) the node.
- A range When pointer resolves to a range, the point designated is the point designating the start (left()) or end (right()) of the range.
- A Point When pointer resolves to a point, that point is the result. The pointer schemes left() and right make no change when given a point as argument.
range(pointer1, pointer2)
- A Node When pointer1 resolves to a node, the starting point of the range is the point immediately preceding the node. When pointer2 resolves to a node, the ending point of the range is the point immediately following the node. It is an error if the ending point precedes the starting point of a range.
- A range When pointer1 resolves to a range R, the starting point of the result range is the same as the starting point of R. When pointer2 resolves to a range R, the ending point of the result range is the ending point of R.
- A Point When pointer1 resolves to a point, that point is the start of the range. When pointer2 resolves to a point, that point is the end of the range.
string-range(pointer, offset, [length])
The string-range() scheme locates a range based on character positions. While string-range endpoints are points adjacent to character positions, they must be designated by the characters to which they are adjacent, in the same way that the nodes corresponding to XML elements are. This avoids ambiguity about which point between two characters is indicated when characters are interrupted by markup.
The pointer argument to string-range() designates a node or a range within which a string is to be located. No string range, even an empty one, can be defined by a string-range() if pointer has the empty string as string value. Every string-range is defined based on an ‘origin character’. The origin is numbered 0, and designates the first character of the string-value of pointer. The offset is a character index relative to the origin; the start of the resulting range is the position designated by the sum of the origin and offset.
If length is specified, the end of the range is at a point adjacent to the character designated by the origin added to length. If the offset is negative, or length is sufficiently large, a string-range can designate characters outside the string-value of the intitial pointer. In this case, characters are located using the string-value of the entire document. It is also legal for length plus the origin to exceed the length of the string-value of the document by one, in order to accomodate ranges that include the last character of a document.
If length is not specified, it defaults to the value 1, and the string range contains one character. If it is specified as 0, the zero-length range is interpreted as the point immediately preceding the origin character.
match(pointer, string[, index])
The match scheme designates the result of a literal match of the argument string within the string-value of the pointer argument. The result is a range from the first matching character to the last. It is an error if there is no matching string. A match may not extend outside the range corresponding to the string value of pointer.
The index argument is an integer greater than or equal to 1, specifying which match should be chosen when there is more than one match within the string-value of pointer. If no index is provided, the default value is 0, indicating the first match found.
Question? should we use 1-origin addressing here? we use 0-origin addressing everywhere else, because it's esier to understand when addressing gaps in strings. It's less natural when counting matches. My user knowledge tells me to be inconsistent here, but my mathematical side rebels. This is not really a technical issue but rther a user-preference one, and so I'd love some guidance.
Examples & discussion
-
A pointer to an XML document stored at the TEI consortium web
server (currently the list of errors reported and fixed
in P4):
http://www.tei-c.org/Drafts/edw77.xml
This points to the entire XML document. -
A pointer to (the HTML version of) this example:
http://www.tei-c.org/Activities/SO/sow09.html#sample-example
-
A pointer to (the XML version of) the previous
example, when the document containing the pointer shares the same URI path
as the designated document:
./sow09.xml#sample-example
orsow09.xml#sample-example
-
a simple xpath match to an anchor that is a whole
node:
http://foo.org/foo.xml#xpath(//p[5])
This selects the 5th paragraph of http://foo.org/foo.xml -
The following selects the 20th character of the fourth paragraph of
the second div of the XML document found at
http://foo.org/foo.xml:
http://foo.org/foo.xml#string-range(xpath(//div[2]/p[4]),20)
-
This could also be used with with the element() scheme:
http://foo.org/foo.xml#string-range(element(/2/2/4), 20)
The second pointer will pick the 20th character of the indicated paragraph.
http://foo.org/foo.xml#range(element(/1/2/3),element(/1/4))alternatively:
http://foo.org/foo.xml#range(element(/1/2/3),xpath(/doc/chapter[4]))