21 Certainty, Precision, and Responsibility

Contenu

Encoders of text often find it useful to indicate that some aspects of the encoded text are problematic or uncertain, and to indicate who is responsible for various aspects of the markup of the electronic text. These Guidelines provide several methods of recording uncertainty about the text or its markup: There are three methods of indicating responsibility for different aspects of the electronic text: No special steps are needed to use the note and respStmt elements, since they are defined in the core module and header respectively. The alt element is only available when the module for linking has been selected, as described in chapter 16 Linking, Segmentation, and Alignment. To use the certainty, precision or respons elements, the module for certainty and responsibility must be selected.
These three elements are all members of an attribute class called att.scoping from which they inherit the following attributes:

These attributes enable statements about certainty, precision, or responsibility to be made with respect to the whole of a document, or any part or parts of it which can be identified using standard XML location methods. Several examples are given in the discussion of the certainty element below; the same mechanisms are available for all three element discussed in this chapter.

21.1 Levels of Certainty

Many types of uncertainty may be distinguished. The certainty element is designed to encode the following sorts:
  • a given tag may or may not correctly apply (e.g. a given word may be a personal name, or perhaps not)
  • the precise point at which an element begins or ends is uncertain
  • the value given for an attribute is uncertain
  • the content given for an element is unreliable for any reason.
The following types of uncertainty are not indicated with the certainty element:

21.1.1 Using Notes to Record Uncertainty

The simplest way of recording uncertainty about markup is to attach a note to the element or location about which one is unsure. In the following (invented) paragraph, for example, an encoder might be uncertain whether to mark ‘Essex’ as a place name or a personal name, since both might be plausible in the given context:

Elizabeth went to Essex. She had always liked Essex.

Using note, the uncertainty here may be recorded quite simply:
<persName>Elizabeth</persName> went to <placeName>Essex</placeName>. She had always liked <placeName>Essex</placeName>.
<note type="certainty" resp="#MSM">It is not
clear here whether <mentioned>Essex</mentioned>
refers to the place or to the nobleman. -MSM</note>
Using the normal mechanisms, the note may be associated unambiguously with specific elements of the text, thus:
<persName>Elizabeth</persName> went to <placeName xml:id="CE-p1a">Essex</placeName>.
She had always liked <placeName xml:id="CE-p1b">Essex</placeName>.
<note type="certainty" resp="#MSM" target="#CE-p1a #CE-p1b">It
is not clear here whether <mentioned>Essex</mentioned>
refers to the place or to the nobleman. If the latter,
it should be tagged as a personal name. -<name xml:id="MSM">Michael</name>
</note>

The advantage of this technique is its relative simplicity. Its disadvantage is that the nature and degree of uncertainty are not conveyed in any systematic way and thus are not susceptible to any sort of automatic processing.

21.1.2 Structured Indications of Uncertainty

To record uncertainty in a more structured way, susceptible of at least simple automatic processing, the certainty element may be used:
  • certainty Indique le degré de certitude ou d'incertitude associé à certains aspects du balisage du texte.
    locus Indique le point précis sur lequel porte l'incertitude de balisage : pertinence de l'élément, position exacte de la balise de début ou de fin, valeur d'un attribut spécifique, etc.
    degree Indique le degré de confiance attribué à l'aspect du balisage que désigne l'attribut locus.
Returning to the example, the certainty element may be used to record doubts about the proper encoding of ‘Essex’ in several ways of varying precision. To record merely that we are not certain that ‘Essex’ is in fact a place name, as it is tagged, we use the target attribute to identify the element in question, and the locus attribute to indicate which aspect of the markup we are uncertain about (in this case, whether we have used the correct ‘name’ for the element used to mark it):
Elizabeth went to
<placeName xml:id="CE-pl1">Essex</placeName>.

<!-- ... elsewhere in the document ... -->
<certainty target="#CE-pl1" locus="name">
 <desc>possibly not a placename</desc>
</certainty>
There are no particular constraints as to where the certainty element is placed in a document; it may be placed adjacent to the target element, or elsewhere in the same or another document. Its position is however significant when the target attribute is not specified as further discussed below.
We may wish to record the probability, assessed in some subjective way, that ‘Essex’ really is a place name here. The degree attribute is used to indicate the degree of confidence associated with the certainty element, expressed as a number between 0 and 1:

<!-- ... --><certainty target="#CE-pl1" locus="name" degree="0.6"/>
This expresses the point of view that there is a 60 percent chance of ‘Essex’ being a place name here, and hence a 40 percent chance of its being a personal name. We can use two certainty elements to indicate the two probabilities independently. Both elements indicate the same location in the text, but the second provides an alternative choice of name identifier (in this case persName), which is given as the value of the assertedValue attribute:

<!-- ... --><certainty target="#CE-pl1" locus="name" degree="0.6">
 <desc>probably a placename, but possibly not</desc>
</certainty>
<certainty
  target="#CE-pl1"
  locus="name"
  degree="0.4"
  assertedValue="persName">

 <desc>may refer to the Earl of Essex</desc>
</certainty>
In the simplest case, it is also possible to place the certainty element within the element concerned:
Elizabeth went to
<placeName>Essex
<certainty locus="name" degree="0.6"/>
</placeName>.
When no target is specified, by default the proposed certainty applies to its parent element, in this case the placeName element. The match attribute discussed below may be used to further vary this behaviour.
21.1.2.1 Contingent conditions
Finally, we may wish to make our probability estimates contingent on some condition. In the passage ‘Elizabeth went to Essex; she had always liked Essex,’ for example, we may feel there is a 60 percent chance that the county is meant, and a 40 percent chance that the earl is meant. But the two occurrences of the word are not independent: there is (we may feel) no chance at all that the first occurrence refers to the county and the second to the earl. We can express this by using the given attribute to list the identifiers of certainty elements.
Elizabeth went to <placeName xml:id="CE-PL1">Essex</placeName>.
She had always liked <placeName xml:id="CE-PL2">Essex</placeName>.

<!-- ... -->
<!-- 60% chance that P1 is a placename, 40% chance a personal name. -->
<certainty
  xml:id="cert-1"
  target="#CE-PL1"
  locus="name"
  degree="0.6">

 <desc>probably a placename, but possibly not"</desc>
</certainty>
<certainty
  xml:id="cert-2"
  target="#CE-PL1"
  locus="name"
  assertedValue="persName"
  degree="0.4">

 <desc>may refer to the Earl of Essex"</desc>
</certainty>
<!-- 60% chance that P2 is a placename, 40% chance a personal name. 100% chance that it agrees with P1. -->
<certainty
  target="#CE-PL2"
  locus="name"
  given="#cert-1"
  degree="1.0">

 <desc>if CE-PL1 is a placename, CE-PL2 certainly is"</desc>
</certainty>
<certainty
  target="#CE-PL2"
  locus="name"
  assertedValue="persName"
  degree="1.0"
  given="#cert-2">

 <desc>if CE-PL1 is a personal name, then so is CE-PL2</desc>
</certainty>
When given conditions are listed, the certainty element is interpreted as claiming a given degree of confidence in a particular markup given the assertional content of the certainty elements indicated. That is, a conjectural assertion is being made solely on the assumption that the interpretation indicated by the element named by the given attribute is actually correct.
Conditional confidence may be less that 100 percent: given the sentence ‘Ernest went to old Saybrook’, we may interpret ‘Saybrook’ as a personal name or a place name, assigning a 60 percent probability to the former. If it is a place name, there may be a 50 percent chance that the place name actually in question is ‘Old Saybrook’ rather than ‘Saybrook’, while if it is correctly tagged as a personal name, it is much more likely (say, 90 percent certain) that the name is ‘Saybrook’. Hence there is uncertainty about the correct location for the markup as well as about which markup to use. This state of affairs can be expressed using the certainty element thus:
Ernest went to <anchor xml:id="CE-a1"/> old <persName xml:id="CE-p2">Saybrook</persName>.

<certainty
  xml:id="cert1"
  target="#CE-p2"
  locus="name"
  degree="0.6"/>

<certainty
  target="#CE-p2"
  locus="start"
  given="#cert1"
  degree="0.9"/>

<certainty
  xml:id="cert2"
  target="#CE-p2"
  locus="name"
  assertedValue="placeName"
  degree="0.4"/>

<certainty
  target="#CE-p2"
  locus="start"
  given="#cert2"
  degree="0.5"/>

<certainty
  xml:id="cert3"
  target="#CE-p2"
  locus="start"
  assertedValue="#CE-a1"
  given="#cert1"
  degree="0.1"/>

<certainty
  xml:id="cert4"
  target="#CE-p2"
  locus="start"
  assertedValue="#CE-a1"
  given="#cert2"
  degree="0.5"/>
Note the use of the assertedValue on certainty elements cert3 and cert4 to reference the anchor element placed at the alternative starting point for the element.
Multiplying the numeric values out, this markup may be interpreted as assigning specific probabilities to three different ways of marking up the sentence:
Ernest went to old <persName>Saybrook</persName>. (0.6 * 0.9, or 0.54)
Ernest went to old <placeName>Saybrook</placeName>. (0.4 * 0.5, or 0.20)
Ernest went to <placeName>old Saybrook</placeName>. (0.4 * 0.5, or 0.20)
The probabilities do not add up to 1.00 because the markup indicates that if ‘Saybrook’ is (part of) a personal name, there is a 10 percent likelihood that the element should start somewhere other than the place indicated, without however giving an alternative location; there is thus a 6 percent chance (0.1 × 0.6) that none of the alternatives given is correct.
21.1.2.2 Pervasive conditions
We may also wish to indicate confidence in some aspect of the tagging throughout a document, rather than (as discussed so far) in one particular instance. The match attribute may be used to supply a pattern identifying the portion of a document concerning which certainty is being expressed. The value of the match attribute is an XPath expression using the syntax defined in Kay (ed.) (2007). In the following example, we wish to indicate a low degree of confidence that the persName elements used throughout the whole document have been correctly applied:
<certainty locus="name" degree="0.3" match="//persName"/>
No target has been supplied here, and so by default the certainty expressed would therefore apply to the parent element. However, in this case the XPath supplied as the value for match returns a set of all the persName elements in the document, independent of the current context. By contrast, in the following example
<div>
 <p>.....</p>
</div>
<div>
 <certainty locus="name" degree="0.3" match=".//persName"/>
</div>
only the persName elements within the second div element are in question. Similarly, we may indicate that we have more confidence in the persName tagging within those div attributes which have a type value of checked:
<certainty locus="name" degree="0.7" match="//div[@type='checked']//persName"/>
If an element in a document is matched by more than one match expression, then the most specific pattern applies. 79 As a simple case, if both the preceding certainty elements were present in the same document, a persName occurring within a <div type="checked"> element would potentially match both pattern expressions. However because the second pattern is more specific than the former, in fact this is the only one that would apply. If multiple patterns match and have the same priority, then the first one (in document order) is applied. Only those statements of certainty which have matched in this sense are available for conditional application using the given attribute mentioned above.
When the match attribute is processed, the namespace bindings in force are those in effect at that point in the document. For example,
<div>
<!-- ... -->
 <certainty match=".//my:*" locus="value" degree="0.9"/>
</div>
might be used to indicate a high degree of certainty about the content of any elements taken the namespace associated with the prefix my. This namespace prefix must be associated with an appropriate namespace definition, either on the certainty element itself, or on one of its ancestor elements.
21.1.2.3 Content uncertainty
Doubts about whether the content of an element is correct may also be expressed by assigning to locus the value value. For example, if the source is hard to read and so the transcription is uncertain:
I have a <emph xml:id="CE-p3">bun</emph>.

<certainty target="#CE-p3" locus="value" degree="0.5"/>
Degrees of confidence in the proper expansion of abbreviations may also be expressed, as in the following example:
You will want to use
<choice>
 <expan xml:id="CE-e1">Standard
   Generalized Markup Language</expan>
 <expan xml:id="CE-e40">Some Grandiose Methodology for Losers</expan>
 <abbr>SGML</abbr>
</choice> ...

<!-- ... -->
<certainty target="#CE-e1" locus="value" degree="0.9"/>
<certainty target="#CE-e40" locus="value" degree="0.5"/>
The assertedValue attribute should be used to provide an alternative value for whatever aspect of the markup is in doubt: an alternative name, or the identifier of an alternative starting or ending point, as already shown, an alternative attribute value, or alternative element content, as in this example:
I have a <emph xml:id="CE-P3">bun</emph>.

<certainty
  target="#CE-P3"
  locus="value"
  assertedValue="gun"
  degree="0.8">

 <desc>a gun makes more sense in a holdup</desc>
</certainty>
Since attribute values have no internal substructure, the assertedValue attribute is not generally useful for specifying alternative transcriptions;it cannot for example be used if the alternative reading contains markup of any kind. More robust methods of handling uncertainties of transcription are the unclear element and the app and rdg elements described in chapter 12 Critical Apparatus. The certainty element allows for indications of uncertainty to be structured with at least as much detail and clarity as appears to be currently required in most ongoing text projects.
21.1.2.4 Target or Match?

As noted in 16 Linking, Segmentation, and Alignment, the target attribute may take any general data.pointer as values and may thus also contain an XPath expression of arbitrary complexity. Because full support for XPath is not provided by current processors, it is not generally recommended TEI practice. There are however some simple cases in which XPath syntax is to be preferred, notably those in which the xml:id attribute is used to identify a single element occurrence. The usage #A (to indicate the element whose xml:id attribute has the value A) is syntactically much simpler than the equivalent xpath2 expression //*[@xml:id='A'] and is hence preferred throughout these guidelines.

For similar reasons, the certainty element may specify both a target value (expressed as an URI) and a match value (expressed as an XPath). The former defines the context within which the latter is to be evaluated. As previously noted, if no value is supplied for target, the context within which the value of match should be evaluated is the parent element of the certainty element itself.

A typical case where it may be convenient to specify both target and match is that where we wish to indicate that the value of an attribute on some specific element is uncertain. In this case, the locus attribute takes the value value. For example, supposing there is only a 50 percent chance that the question was spoken by participant A:
<u xml:id="CE-u1" who="#A">Have you heard the election results?</u>
<certainty
  target="#CE-u1"
  match="@who"
  locus="value"
  degree="0.5"/>
or, equivalently and without the need to define a target,
<u who="#A">Have you heard the election results?<certainty match="@who" locus="value" degree="0.5"/>
</u>
The match and target attributes together provide a powerful mechanism which can be used to indicate precision for a large number of assertions throughout an encoded document in an economical way. Some further examples follow:
<certainty match="//p" locus="location" degree="0.2"/>
This encoding indicates that there is only a 0.2 certainty that the boundaries of all p elements in the document have been correctly identified.
<certainty
  target="#a101"
  match="p"
  locus="location"
  degree="0.2"/>
This encoding indicates that there is only a 0.2 certainty that the boundaries of the p elements contained by the element with xml:id value a101 have been correctly identified.
<persName resp="#LB">Essex
<certainty match="@resp" locus="value" degree="0.2"/>
</persName>
This encoding indicates that there is only a 0.2 certainty that the value for the resp attribute on the given persName element is correct.
<certainty match="//*/@resp" locus="value" degree="0.2"/>
This encoding indicates that there is only a 0.2 certainty that any value for the resp attribute is correct, wherever it appears in the document.
<certainty
  target="#dd001"
  match="@resp"
  locus="value"
  degree="0.2"/>
This encoding indicates that there is only a 0.2 certainty that the value for the resp attribute of the element indicated by the pointer #dd001 is correct
<certainty match="//*[@resp='#LB']" locus="value" degree="0.2"/>
This encoding indicates that there is only a 0.2 certainty that the content of any element the resp attribute of which has the value #LB is correct, wherever it appears in the document.

The certainty element and the other TEI mechanisms for indicating uncertainty provide a range of methods of graduated complexity. Simple expressions of uncertainty may be made by using the note element. This is simple and convenient, and can accommodate either a discursive and unstructured indication of uncertainty, or a complex and structured but probably project-specific expression of uncertainty. In general, however, unless special steps are taken, the note element does not provide as much expressive power as the certainty element, and in cases where highly structured certainty information must be given, it is recommended that the certainty element be preferred.

21.2 Indications of Precision

As noted above, certainty about the accuracy of an encoding or its content is not the same thing as the precision with which a value is specified. In the case of a date or a quantity, for example, we might be certain that the value given is imprecise, or uncertain about whether or not the value given is correct. The latter possibility would be represented by the certainty element discussed in the previous section; the former by the precision element discussed in this section.

The elements concerning which statements of precision are to be made are identified using the same target and match attributes inherited from the att.scoping class discussed in the previous section and in the same way. Other aspects are provided by other attributes as further discussed below.
  • precision indicates the numerical accuracy or precision associated with some aspect of the text markup.
    degree indicates the degree of precision to be assigned as a value between 0 (none) and 1 (optimally precise)
    stdDeviation supplies a standard deviation associated with the value in question
In 3.5.3 Numbers and Measures several ways of indicating ranges of values were introduced. For example, if we know that a date falls between 1930 and 1935, without being certain exactly where, this fact may be encoded using attributes notBefore and notAfter, as in the following example:
<date notBefore="1930" notAfter="1935">Early in the 1930s</date>...
Equally, if we know that every page of a manuscript has a width of at least 10 cm but no more than 30, we can use the attributes atLeast and atMost, as in the following examples:
<width
  atLeast="10"
  atMost="30"
  unit="cm"
  scope="all"/>

Suppose however that the precision with which the value of such an attribute can be specified is variable. For example, suppose an event is dated ‘about fifty years after the death of Augustus’. In this case, the precision of one end of the range (the death of Augustus) is higher than the other, assuming we know when Augustus died. We can say that the latest possible date is probably 50 years after that, but with less confidence than we can attach to the earliest possible date.

The precision element allows us to indicate the two attributes concerned and attach different degrees of precision to them, using the same mechanism as that provided for the certainty element:
<date xml:id="d001" notBefore="0014" notAfter="0064">About 50
years after the death of Augustus</date>
<precision target="#d001" match="@notAfter" degree="0.3"/>
<precision target="#d001" match="@notBefore" degree="0.9"/>
In much the same way, we may wish to indicate different degrees of precision about the dating of either end of a historical period. For example, the elements defined for encoding personal data all bear a similar set of attributes to indicate normalized values for earliest or latest dates, etc. (see section 13.1.2 Dating Attributes); the precision of these attribute values may be indicated in exactly the same way. For example,
<residence from="1857-03-01" notAfter="1857-04-30">From the 1st of March to
some time in April of 1857.
<precision match="@notAfter" degree="0.5"/>
</residence>
It may also be useful to indicate that the precisions given for minimum and maximum quanta differ. For example, to indicate that all pages measure at least 10 cm wide, and at most about 30:
<width
  xml:id="w00t"
  atLeast="10"
  atMost="30"
  unit="cm"
  scope="all"/>

<precision target="#w00t" match="@atMost" degree="0.3"/>
The stdDeviation attribute may be used to indicate the standard deviation for a range of values. The generic dim element introduced in 10.3.4 Dimensions might be used to record the average number of characters per line in a typescript. If in addition we wish to record the standard deviation for the values summarised by that average, this would require an additional precision element, as in the following example:
<dim
  xml:id="dim1"
  type="avgLineLength"
  unit="chars"
  quantity="62.4"/>

<precision target="#dim1" stdDeviation="4"/>

21.3 Attribution of Responsibility

In general, attribution of responsibility for the transcription and markup of an electronic text is made by respStmt elements within the header: specifically, within the title statement, the edition statement(s), and the revision history.

In some cases, however, more detailed element-by-element information may be desired. For example, an encoder may wish to distinguish between the individuals responsible for transcribing the content and those responsible for determining that a given word or phrase constitutes a proper noun. Where such fine-grained attribution of responsibility is required, the respons element can be used.
  • respons (Responsabilité) identifie le ou les personne(s) responsable(s) d'un aspect du balisage pour un ou plusieurs éléments particuliers.
    locus Indique l'aspect spécifique du balisage sur lequel porte la responsabilité.
    resp (responsable) Identifie la personne ou l'organisme responsable de l'aspect en question dans le document TEI

This element allows one or more aspects of the markup to be attributed to a given individual. This element inherits the target and match attributes from the att.scoping class, in the same way as the certainty and precision elements. Its locus attribute functions in the same way as that on the certainty element (see 21.1 Levels of Certainty).

For example, the following encoding indicates that RC is responsible for transcribing an illegible word, and that PMWR is responsible for identifying that word as a proper noun, i.e. deciding to mark it with the persName element at the location indicated:
Ernest went to old
<persName xml:id="CE-p5" rend="it">Saybrook</persName>.

<!-- ... -->
<respons target="#CE-p5" locus="value" resp="#RC"/>
<respons target="#CE-p5" locus="name location" resp="#PMWR"/>
<list type="encoders">
 <item xml:id="PMWR"/>
 <item xml:id="RC"/>
</list>
Similarly, in the following example, we indicate that RC is responsible for proposing the value of the rend attribute:
<respons
  target="#CE-p5"
  match="@rend"
  locus="value"
  resp="#RC"/>

Some elements bear specialized resp or agent attributes, which have specific meanings that vary from element to element; the respons element should be reserved for the general aspects of responsibility common to all text transcription and markup, and should not be confused with the more specific attributes on individual elements.

21.4 The Certainty Module

The module described in this chapter makes available the following additional elements:
Module certainty: Degré de certitude et responsabilité
The selection and combination of modules to form a TEI schema is described in 1.2 Defining a TEI Schema.

Contenu « 20 Non-hierarchical Structures » 22 Documentation Elements

Notes
79.
Specificity of pattern matching is defined further in the XSLT2 reference cited above (see http://www.w3.org/TR/xslt20/#conflict)

[English] [Deutsch] [Español] [Italiano] [Français] [日本語] [한국어] [中文]



Copyright TEI Consortium 2007 Licensed under the GPL. Copying and redistribution is permitted and encouraged.
Version 1.7.0. Last updated on July 6th 2010.This page generated on 2010-07-07T22:07:13Z