Canonical References
Contents
Introduction
A canonical reference is a means, specific to a community or corpus, of pointing into documents. For example, biblical scholars might understand ‘Matt 5:7’ to mean ‘the book called Matt, chapter 5, verse 7.’ They might then wish to translate the string ‘Matt 5:7’ into a pointer into a TEI-encoded document, selecting the element which corresponds to the seventh <div> element within the fifth <div> element within the <div> element with the n attribute valued ‘Matt.’
<refsDecl id="biblical"> <fragmentPattern re="(.+) (.+):(.+)" pat="xpath(//div[@n='$1']/div[$2]/div[$3])"> <refDesc>This pointer pattern extracts and references the <q>book,</q> <q>chapter,</q> and <q>verse</q> parts of a biblical reference.</refDesc> </fragmentPattern> <fragmentPattern re="(.+) (.+)" pat="xpath(//div[@n='$1']/div[$2])"> <refDesc>This pointer pattern extracts and references the <q>book</q> and <q>chapter</q> parts of a biblical reference.</refDesc> </fragmentPattern> <fragmentPattern re="(.+)" pat="xpath(//div[@n='$1'])"> <refDesc>This pointer pattern extracts and references just the <q>book</q> part of a biblical reference.</refDesc> </fragmentPattern> </refsDecl>An explanation of the use of this element is below.
Algorithm for extracting and referencing targets
This story is continued in <ptr cref="Matt 5:7" decls="#biblical"/>.and wants to be able to convert it to a standard URI Reference that corresponds to ‘Matt 5:7’.
The application first follows the URI in the decls attribute, which points to a <refsDecl> element in the local document or a remote document. Within that declaration (see above for the corresponding example declaration), it refers to the list of <fragmentPattern> s, and for each pattern, applies the regular expression to the reference ‘Matt 5:7’. If the first regular expression matches, it applies the matched substrings (in this case, ‘Matt’, ‘5’, and ‘7’) to the string in the pat attribute of that <fragmentPattern> element, substituting the first matched substring for $1, the second for $2, and so on, to produce an Fragment Identifier. It then takes that Fragment-ID and appends it (with an intervening #) to each of the URIs specified by the <ptr> elements that precede the <fragmentPattern> elements to generate a URI Reference. If the regular expression in the first <fragmentPattern> element does not match, the regular expression in the second <fragmentPattern> element is tried, and so on.
Worked examples
Specifically, in this case, the application would first apply the regular expression (.+) (.+):(.+) to ‘Matt 5:7’. This regular expression would successfully match. The first matched substring would be ‘Matt’, the second ‘5’, and the third ‘7’. The application would then apply these substrings to the pattern xpath(//div[@n='$1']/div[$2]/div[$3]), producing xpath(//div[@n='Matt']/div[5]/div[7]). It would append this to xml:base in force, thus generating the URI Reference http://www.jph.org/resources/books/Bible.xml#xpath(//div[@n='Matt']/div[5]/div[7]).
If, however, the input string had been ‘Matt 5’, the first regular expression would not have matched. The application would have then tried the second, (.+) (.+), producing a successful match, and the matched substrings ‘Matt’ and ‘5’. It would have then proceeded to produce the URI Reference http://www.jph.org/resources/books/Bible.xml#xpath(//div[@n='Matt']/div[5]).
If the input string had been ‘Matt’, neither the first nor the second regular expressions would have successfully matched. The application would have then tried the third, (.+), producing the matched substring ‘Matt’, and the URI Reference http://www.jph.org/resources/books/Bible.xml#xpath(//div[@n='Matt']).
<fragmentPattern re="(.+) (.+):(.+)" pat="//div[@n='$1']/div[$2]/div[$3]/p[$4]"/>would produce an error, since only three matched substrings would have been produced, but a fourth ($4) was referenced.
It is quite reasonable to believe that encoders would actually prefer much more precise regular expressions than those used as examples above. E.g., ^\s*([1-9]?[A-Z][a-z]+)\s+([1-9][0-9]?[0-9]?):([1-9][0-9]?)\s*$.
Miscellaneous usages
Canonical reference pointers are intended for use by TEI encoders. However, this specification might be useful to the development of a process for recognizing canonical references in non-TEI documents (such as plain text documents), possibly as part of their conversion to TEI.