11 Representation of Primary Sources

This chapter defines a module intended for use in the representation of primary sources, such as manuscripts or other written materials. Section 11.1 Digital Facsimiles provides elements for the encoding of digital facsimiles or images of such materials, while the remainder of the chapter discusses ways of encoding detailed transcriptions of such materials. This module may also be useful in the preparation of critical editions, but the module defined here is distinct from that defined in chapter 12 Critical Apparatus, and may be used independently of it. Detailed metadata relating to primary sources of any kind may be recorded using the elements defined by the manuscript description module discussed in chapter 10 Manuscript Description, but again the present module may be used independently if such data is not required.

Although this chapter discusses manuscript materials more frequently than other forms of written text, most of the recommendations presented are equally applicable mutatis mutandis to the encoding of printed matter or indeed any form of written source, including monumental inscriptions. Similarly, where in the following descriptions terms such as ‘scribe’, ‘author’, ‘editor’, ‘annotator’ or ‘corrector’ are used, these may be re-interpreted in terms more appropriate to the medium being transcribed. In printed material, for example, the ‘compositor’ plays a role analogous to the ‘scribe’, while in an authorial manuscript, the author and the scribe are the same person.

11.1 Digital Facsimiles Digital Facsimiles¶

These Guidelines are mostly concerned with the preparation of digital texts in which pre-existing sources are transcribed or otherwise converted into character form, and marked up in XML. However, it is also very common practice to make a different form of ‘digital text’, which is instead composed of digital images of the original source, typically one per page, or other written surface. We call such a resource a digital facsimile. A digital facsimile may, in the simplest case, just consist of a collection of images, with some metadata to identify them and the source materials portrayed. It may sometimes contain a variety of images of the same source pages, perhaps of different resolutions, or of different kinds. Such a collection may form part of any kind of document, for example a commentary of a codicological or paeleographic nature, where there is a need to align explanatory text with image data. It may also be complemented by a transcribed or encoded version of the original source, which may be linked to the page images. In this section we present elements designed to support these various possibilities and discuss the associated mechanisms provided by these Guidelines.

When this module is included in a schema, the class att.global is extended to include two new pointer attributes, facs and change:

att.global.facs 要素facsimileに含まれる画像や紙面に関する要素．
facs (facsimile) 当該要素に対応する要素facsimileにある画像やその部分への参照．
att.global.change supplies the change attribute, allowing its member elements to specify one or more states or revision campaigns with which they are associated.
change points to one or more change elements documenting a state or revision campaign to which the element bearing this attribute and its children have been assigned by the encoder.

The change attribute is discussed further below in section 11.7 Changes. The facs attribute is used to associate any element in a transcription with an image of the corresponding part of the source, by means of the usual URI pointing mechanism.

In the simple case where a digital text is composed of page images, the facs attribute on the pb element may be used to associate each image with an appropriate point in the text:

By convention, this encoding indicates that the image indicated by the facs attribute represents the whole of the text following the pb (pagebreak) element, up to the next pb element. Any convenient milestone element (see further 3.10.3 Milestone Elements) could be used in the same way; for example if the images represent individual columns, the cb element might be used. Though simple, this method has some drawbacks. It does not scale well to more complex cases where, for example, the images do not correspond exactly with transcribed pages, or where the intention is to align specific marked up elements with detailed images, or parts of images. The management of information about the images may become more difficult if references to them are scattered through many files rather than being concentrated in a single identifiable location. Nevertheless, this solution may be adequate for many straightforward ‘digital library’ applications.

The recommended approach to encoding facsimiles is instead to use the facs attribute in conjunction with the elements facsimile or sourceDoc, and the elements surface, surfaceGrp, and zone, which are also provided by this module. These elements make it possible to accommodate multiple images of each page, as well as to record the position and relative size of elements identified any kind of written surface and to link such elements with digital facsimile images of them. Typical applications include the provision of full text search in ‘digital facsimile editions’, and ways of annotating graphics, for example so as to identify individuals appearing in group portraits and link them to data about the people represented.

The following elements are available to represent components of a digital facsimile:

facsimile 転記または符号化されたテキストではなく，画像データ中にある，書記資料の表現を示す．
sourceDoc contains a transcription or other representation of a single source document potentially forming part of a dossier génétique or collection of sources.
surface 矩形の座標により，書記の表面を定義する．選択的に，空間や矩形範囲中のひとつ以上の図表表現をまとめる．
surfaceGrp defines any kind of useful grouping of written surfaces, for example the recto and verso of a single leaf, which the encoder wishes to treat as a single unit.

zone 要素surfaceにある表面上の矩形範囲を定義する．

points

identifies a two dimensional area within the bounding box specified by the other attributes by means of a series of pairs of numbers, each of which gives the x,y coordinates of a point on a line enclosing the area.

Either of the facsimile or sourceDoc elements may be used to represent a digital facsimile. Either may appear within a TEI document along with, or instead of, the text element introduced in section 4 Default Text Structure. The facsimile element is designed for the case where the digital facsimile contains only images, whereas the sourceDoc element is for use in the case where such images are complemented by a documentary transcription. In this section, we first discuss the simpler case, returning to the use of the sourceDoc element in section 11.2 Combining Transcription with Facsimile below. When this module is selected therefore, a legal TEI document may thus comprise any of the following :

a TEI Header and a text element
a TEI Header and a facsimile element
a TEI Header and a sourceDoc element
a TEI Header, a facsimile element, and a text element
a TEI Header, one or more sourceDoc or facsimile elements, and a text element

Like the text element, a facsimile element may also contain an optional front or back element, used in the same way as described in sections 4.5 Front Matter and 4.7 Back Matter.

In the simplest case, a facsimile just contains a series of graphic elements, each of which identifies an image file:

If desired, the binaryObject element described in 3.9 Graphics and other non-textual components (or any other element from the model.graphicLike class) can be used instead of a graphic.

In this simple case, the four page images are understood to represent the complete facsimile, and are to be read in the sequence given. Suppose, however, that the second page of this particular work is available both as an ordinary photograph and as an infra-red image, or in two different resolutions. The surface element may be used to group the two image files, since these correspond with the same area of the work:

The surface element provides a way of indicating that the two images of page2 represent the same surface within the source material. A surface might be one side of a piece of paper or parchment, an opening in a codex treated as a single surface by the writer, a face of a monument, a billboard, a membrane of a scroll, or indeed any two-dimensional surface, of any size.

The surfaceGrp element may be used to indicate that two (or more) surfaces are associated in some way, for example because they represent the recto and verso of the same leaf, as in this example:

The surfaceGrp element may also be useful as a means of identifying other groups of written surfaces, such as adjacent faces of a monument, or gatherings of leaves.

Simply grouping related graphics is not however the main purpose of the surface element: rather it is to help identify the location and size of the various two-dimensional spaces constituting the digital facsimile. Note that the actual dimensions of the object represented are not provided by the surface element ; rather, the surface element defines an abstract coordinate space which may be used to address parts of the image. Four attributes supplied by the att.coordinated class are used to define this space.

att.coordinated 2次元座標システムによる，場所を示す要素．

ulx	矩形における左上点のX軸の値を示す．
uly	矩形における左上点のY軸の値を示す．
lrx	矩形における右下点のX軸の値を示す．
lry	矩形における右下点のY軸の値を示す．

By default, the same coordinate space is used for a surface and for all of its child elements.³⁵ It may be most convenient to derive a coordinate space from a digital image of the surface in question such that each pixel in the image corresponds with a whole number of units (typically 1) in the coordinate space. In other cases it may be more convenient to use units such as millimetres. Neither practice implies any specific mapping between the coordinate system used and the actual dimensions of the physical object represented.

A surface element can contain one or more zone elements, each of which represents a region or bounding box defined in terms of the same coordinate space as that of its parent surface element. A zone may be rectangular or non-rectangular: a rectangular zone is defined by a sequence of four coordinates in the same way as a surface; a non-rectangular zone is defined using the attribute points, which provides a sequence of coordinates, each of which specifies a point on the perimeter of the zone.³⁶

A zone may be used to define any region of interest, such as a detail or illustration, or some part of the surface which is to be aligned with a particular text element, or otherwise distinguished from the rest of the surface. A surface establishes a coordinate system which may be used to address parts or the whole of some digital representation of a written surface. A zone, by contrast, defines any arbitrary area of interest relative to that surface, using the same coordinate system. It might be bigger or smaller than its parent surface, or might overlap its boundaries. The only constraint is that it must be defined using the same coordinate system.

When an image of some kind is supplied within either a zone or a surface, the implication is that the image represents the whole of the zone or surface concerned. In the simple case therefore, we might imagine a surface defining a page, within which there is a graphic representing the whole of that page, and a number of zones defining parts of the page, each with its own graphic, each representing a part of the page. If however one of those graphics actually represents an area larger than the page (for example to include a binding or the surface of a desk on which the page rests), then it will be enclosed by a zone with coordinates larger than those of the parent surface.

For example, consider the following figure:

図 11.1. Badische Landesbibliothek, Manuscript Durlach 1, Fols. 95v-96r

This is an image of a two page spread from a manuscript in the Badische Landesbibliothek, Karlsruhe. We have no information as to the dimensions of the original object, but the low resolution image displayed here contains 500 pixels horizontally and 321 pixels vertically. For convenience, we might map each pixel to one cell of the coordinate space.³⁷

We therefore define a surface element corresponding with the area of the image which represents the whole of the two page spread and embed the graphic within it:

If desired, the binaryObject element described in 3.9 Graphics and other non-textual components (or any other element from the model.graphicLike class) may be used instead of a graphic element.

Since the image in this example is of a two page opening, we will probably wish to define at least two nested zones, one for each page :

As this example shows, in addition to acting as a container for graphic elements, zone elements may be used to identify parts of a surface for analytical purposes.

The relationship between zone and surface can be quite complex : for example, it may be appropriate to treat the whole of a two page spread as a single written surface, perhaps because particular written zones span both pages. A zone may contain a nested surface, if for example a page has an additional scrap of paper attached to it. A zone may be of any shape, not simply rectangular. Discussion of these and other cases are provided in section 11.4 Advanced uses of Surface and Zone below.

In the following extended example, we discuss a hypothetical digital edition of an early 16th century French work, Charles de Bovelles' Géometrie Pratique.³⁸ In this edition, each page has been digitized as a separate file: for example, recto page 49 is stored in a file called Bovelles-49r.png. In the facsimile element used to contain the whole set of pages, we define a surface element for this page, which we situate within a coordinate scale running from 0 to 200 in the x (horizontal) axis, and 0 to 300 in the y (vertical) axis. The surface element contains a graphic element which represents the whole of this surface:

We can now identify distinct zones within the page image using the coordinate scale defined for the surface. In the following figure 図 11.2, Detail of p 49r from Bovelles Géometrie Pratique we show the upper part of the page, with boxes indicating four such zones. Each of these will be represented by a zone element, given within the surface element already defined, and specified in terms of the same coordinate system. Some zones of interest are indicated by red lines in the following image.

図 11.2. Detail of p 49r from Bovelles Géometrie Pratique

The following encoding defines each of the four zones identified in the figure above.

Note that the location of each zone is defined independently but using the same coordinate system.

A non rectangular-zone, for example that containing the word cloche. at bottom left of the page, could also be defined, using the points attribute :

In this example a single graphic element has been associated directly with the surface of the page rather than nesting it within a zone. However, it is also possible to include multiple zone elements which contain a graphic element, if for example a detailed image is available. Since all zone elements use the same coordinate system (that defined by their parent surface), there is no need to demonstrate enclosure of one zone within another by means of nesting. To continue the current example, supposing that we have an additional image called Bovelles49r-detail.png containing an additional image of the figure in the third zone above, we might encode that zone as follows:

11.2 Combining Transcription with Facsimile Combining Transcription with Facsimile¶

A digitized source document may contain nothing more than page images and a small amount of metadata. It may also contain an encoded transcription of the pages represented, which may either be ‘embedded’ within a sourceDoc element, or supplied in parallel with a facsimile as defined above.

If the transcription is regarded as a text in its own right, organized and structured independently of its physical realization in the document or documents represented by the facsimile, then the recommended practice is to use the text element to contain such a structured representation, and to present it in parallel. The text element is a sibling of the facsimile and sourceDoc elements. This approach is illustrated in section 11.2.1 Parallel Transcription below. Alternatively, if the transcription is intended not to prioritize representation of the final text so much as the process by which the document came to take its present form, or the physical disposition of its component parts, it may be preferable to present it as an embedding transcription, as further described in section 11.2.2 Embedded Transcription below.

11.2.1 Parallel Transcription Parallel Transcription¶

Suppose now that we wish to align a transcription of the page discussed in the preceding section with particular zones. We begin by giving each relevant part of the facsimile an identifier:

The alignment between transcription and image is made, as usual, by means of the facs attribute:

<pb facs="#B49r"/>
<fw>De Geometrie 49</fw>
<head facs="#B49rHead"> DU SON ET ACCORD DES CLOCHES ET <lb/> des alleures des
chevaulx, chariotz & charges, des fontaines:& <lb/> encyclie du monde,
& de la dimension du corps humain.<lb/> Chapitre septiesme</head>
<div n="1">
<p>Le son & accord des cloches pendans en ung mesme <lb/> axe, est faict en
   contraires parties.</p>
<p rend="it" facs="#B49rPara2">LEs cloches ont quasi fi<lb/>gures de rondes
   pyra<lb/>mides imperfaictes & <lb/> irregulieres: & leur accord se
<lb/> fait par reigle geometrique. Com<lb/>me si les deux cloches C & D
<lb/> sont <w facs="#B49rW457">pendans</w> à ung mesme axe <lb/> ou essieu A B:
   je dis que leur ac<lb/>cord se fera en co<ex>n</ex>traires parties<lb/>
   co<ex>m</ex>me voyez icy figuré. Car qua<ex>n</ex>d <lb/> lune sera en
   hault, laultre declinera embas. Aultrement si elles decli<lb/>nent toutes deux
   ensembles en une mesme partie, elles seront discord, <lb/> & sera leur
   sonnerie mal plaisante à oyr.<figure facs="#B49rFig1">
   <graphic url="Bovelles49r-detail.png"/>
  </figure>
</p>
</div>

It is also possible to point in the other direction, from a surface or zone to the corresponding text. This is the function of the start attribute, which supplies the identifier of the element containing at least the start of the transcribed text found within the surface or zone concerned. Thus, another way of linking this page with its transcription would be simply

<facsimile>
<surface start="#PB49R">
  <graphic url="Bovelles-49r.png"/>
</surface>
</facsimile>
<text>
<body>
  <div>

   <pb xml:id="PB49R"/>
   <fw>De Geometrie 49</fw>

  </div>
</body>
</text>

11.2.2 Embedded Transcription Embedded Transcription¶

An embedded transcription is one in which words and other written traces are encoded as subcomponents of elements representing the physical surfaces carrying them rather than independently of them.

The following elements are available for this purpose:

sourceDoc contains a transcription or other representation of a single source document potentially forming part of a dossier génétique or collection of sources.
surface 矩形の座標により，書記の表面を定義する．選択的に，空間や矩形範囲中のひとつ以上の図表表現をまとめる．
zone 要素surfaceにある表面上の矩形範囲を定義する．
line contains the transcription of a topographic line in the source document
seg 任意の句レベルのテキスト単位を示す(要素segを含む)．

The elements surface, surfaceGrp, and zone were introduced above in section 11.1 Digital Facsimiles When supplied within a sourceDoc element, these elements may contain transcriptions of the written content of a source in addition to or as an alternative to digital images of them. Such transcription may be placed directly within the zone element, or within one or more line elements, for cases where the writing is linear, in the sense that it is composed of discrete tokens organized physically into groups, typically organized in a sequence corresponding with the way they are intended to be read. Depending on the directionality of the writing system used, this might be any combination of top-down and left to right, or vice versa. The element line may be used to hold a complete group of such tokens. Where, however, the lineation is not considered significant, any group of tokens may be indicated using the zone element. The seg element described in section 16.3 Blocks, Segments, and Anchors may also be used to indicate smaller sequences of tokens within zone, or line as appropriate.

Returning to the preceding example, we might transcribe the content of the zone to which we gave the identifier B49rPara2 within a sourceDoc element as follows :

<sourceDoc>
<surface
   ulx="0"
   uly="0"
   lrx="200"
   lry="300">
  <zone
    ulx="0"
    uly="0"
    lrx="200"
    lry="300">
   <graphic url="Bovelles-49r.png"/>
  </zone>

  <zone
    ulx="28"
    uly="75"
    lrx="175"
    lry="178">
   <line>LEs cloches ont quasi
       fi</line>
   <line>gures de rondes pyra</line>
   <line>mides imperfaictes &
   </line>
   <line> irregulieres: & leur accord se</line>
   <line> fait par reigle geometrique. Com</line>
   <line>me si les deux cloches C
       & D </line>
   <line> sont <zone
      ulx="45"
      uly="125"
      lrx="60"
      lry="130">pendans</zone> à ung mesme axe</line>
   <line> ou essieu A B: je dis que
       leur ac</line>
   <line>cord se fera en cõtraires parties</line>
   <line> cõme
       voyez icy figuré. Car quãd </line>
   <line> lune sera en hault, laultre
       declinera embas. Aultrement si elles declinent toutes deux ensembles en une
       mesme partie, elles seront discord,</line>
   <line> & sera leur sonnerie
       mal plaisante à oyr.</line>
  </zone>
  <zone
    ulx="105"
    uly="76"
    lrx="175"
    lry="160">
   <graphic url="Bovelles49r-detail.png"/>
  </zone>
</surface>
</sourceDoc>

As mentioned above, some or all of the written surfaces being transcribed may be composed of physically distinct scraps. In the following example, taken from the Walt Whitman Archive, two pieces of newsprint have been glued to a piece of blue paper on which a poem is being drafted:

Single leaf of notes possibly related to the poem eventually titled
Sleepers. From the Walt Whitman Archive (Duke 258). — 図 11.3. Single leaf of notes possibly related to the poem eventually titled Sleepers. From the Walt Whitman Archive (Duke 258).

The two pieces of newsprint might simply be regarded as special kinds of zone, but they are also new surfaces, since they might contain additional written zones themselves (such as the numbers in this case).

Using these elements, the Whitman draft above might be encoded as follows:

<surface>
<zone>
  <line>Poem</line>
  <line>As in Visions of — at</line>
  <line>night —</line>
  <line>All sorts of fancies running through</line>
  <line>the head</line>
</zone>
<zone>
  <surface type="newsprint" attachment="glue" flipping="false">
   <zone>Spring has just set in here, and the weather.... a steamer </zone>
   <metamark function="sequence">2</metamark>
  </surface>
</zone>
<zone>
  <surface type="newsprint" attachment="glue" flipping="false">
   <zone>"The shores on either side of the Sound are... The In- </zone>
   <metamark function="sequence">3</metamark>
  </surface>
</zone>
</surface>

evidence	当該解釈や調整の信頼度や正確さを判断する証拠を示す．
source	当該読みの元になる資料を示す，ひとつ以上のポインタ．

cert	(certainty) 当該解釈や調整の確信度を示す．
resp	(responsible party) 当該解釈や調整の責任者を示す．例えば，編集者，翻訳者など．

type	当該要素の分類を示す．
subtype	必要であれば，当該要素の下位分類を示す．

seq	(sequence) 当該属性が示す素性が出現すると想定されている順番の，番号を示す．
status	当該調整の影響を示す．例えば，削除の際，取消線の範囲が多すぎたり少なすぎたりする場合や，追加の際，既にあるテキストの部分をコピーして挿入したりする場合．
hand	当該調整を行った主体の筆致を特定する．

reason	省略の理由を示す．例えば，見本, 聞こえない, 無関係, 取り消し, 取り消しがありかつ判読できない，など．
hand	特定可能な筆致による熟慮した削除の場合，転記の際にその筆致を示す．
agent	損傷が原因のテキスト省略の場合，特定可能であれば当該損傷を分類する．

scribe	当該筆致に対応すると十分に信じられる筆写者の一般的な名前または識別子を示す．
script	当該筆致で使用されている特定の筆体や書記スタイルの特徴を示す．例えば，secretary(書記官スタイル),　 copperplate(銅板スタイル),　 Chancery(公文書スタイル), Italian(イタリアスタイル)など．
scribeRef	points to a full description of the scribe concerned, typically supplied by a person element elsewhere in the description.
scriptRef	points to a full description of the script or writing style used by this hand, typically supplied by a scriptNote element elsewhere in the description.
medium	インクの種類や色合い，例えば，茶色や，筆記具の種類，例えば，鉛筆など．
scope	当該筆致が，当該手書き資料中で，どの程度出現しているかを示す．

hand	当該損傷部分の書き手が特定できる場合，それを示す．当該損傷部分の書き手が特定できる場合，それを示す．
agent	当該損傷の原因の分類を示す．当該損傷の原因の分類を示す．
degree	当該損傷部分の程度を示す．要素damageの属性 degreeは，当該損傷部分のテキストが確認できる場合にのみ使用されるべきである．他の資料から補われたテキストの場合には，要素suppliedで示されるべきである．当該損傷部分の程度を示す．要素damageの属性 degreeは，当該損傷部分のテキストが確認できる場合にのみ使用されるべきである．他の資料から補われたテキストの場合には，要素suppliedで示されるべきである．
group	各損傷部分に，物理的状況を示す，任意の数値を付与する．各損傷部分に，物理的状況を示す，任意の数値を付与する．

extent	indicates the size of the object concerned using a project-specific vocabulary combining quantity and units in a single string of words.
unit	当該大きさの単位を示す．
quantity	当該単位の大きさを示す．

min	where the measurement summarizes more than one observation or a range, supplies the minimum value observed.
max	where the measurement summarizes more than one observation or a range, supplies the maximum value observed.
atLeast	gives a minimum estimated value for the approximate measurement.
atMost	gives a maximum estimated value for the approximate measurement.

function	describes the function (for example status, insertion, deletion, transposition) of the mark.
target	identifies one or more elements to which the function indicated by the metamark applies.

P5: TEIガイドライン

11 Representation of Primary Sources