15 Language Corpora
Contenu
The term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize a particular state or variety of one or more languages. Because opinions as to the best method of achieving this goal differ, various subcategories of corpora have also been identified. For our purposes however, the distinguishing characteristic of a corpus is that its components have been selected or structured according to some conscious set of design criteria.
These design criteria may be very simple and undemanding, or very sophisticated. A corpus may be intended to represent (in the statistical sense) a particular linguistic variety or sublanguage, or it may be intended to represent all aspects of some assumed ‘core’ language. A corpus may be made up of whole texts or of fragments or text samples. It may be a ‘closed’ corpus, or an ‘open’ or ‘monitor’ corpus, the composition of which may change over time. However, since an open corpus is of necessity finite at any particular point in time, the only likely effect of its expansibility from the encoding point of view may be some increased difficulty in maintaining consistent encoding practices (see further section 15.5 Recommendations for the Encoding of Large Corpora). For simplicity, therefore, our discussion largely concerns ways of encoding closed corpora, regarded as single but composite texts.
Language corpora are regarded by these Guidelines as composite texts rather than unitary texts (on this distinction, see chapter 4 Default Text Structure). This is because although each discrete sample of language in a corpus clearly has a claim to be considered as a text in its own right, it is also regarded as a subdivision of some larger object, if only for convenience of analysis. Corpora share a number of characteristics with other types of composite texts, including anthologies and collections. Most notably, different components of composite texts may exhibit different structural properties (for example, some may be composed of verse, and others of prose), thus potentially requiring elements from different TEI modules.
Aside from these high-level structural differences, and possibly differences of scale, the encoding of language corpora and the encoding of individual texts present identical sets of problems. Any of the encoding techniques and elements presented in other chapters of these Guidelines may therefore prove relevant to some aspect of corpus encoding and may be used in corpora. Therefore, we do not repeat here the discusssion of such fundamental matters as the representation of multiple character sets (see chapter vi. Languages and Character Sets); nor do we attempt to summarize the variety of elements provided for encoding basic structural features such as quoted or highlighted phrases, cross-references, lists, notes, editorial changes and reference systems (see chapter 3 Elements Available in All TEI Documents). In addition to these general purpose elements, these Guidelines offer a range of more specialized sets of tags which may be of use in certain specialized corpora, for example those consisting primarily of verse (chapter 6 Verse), drama (chapter 7 Performance Texts), transcriptions of spoken text (chapter 8 Transcriptions of Speech), etc. Chapter 1 The TEI Infrastructure should be reviewed for details of how these and other components of the Guidelines should be tailored to create a document type definition appropriate to a given application. In sum, it should not be asssumed that only the matters specifically addressed in this chapter are of importance for corpus creators.
This chapter does however include some other material relevant to corpora and corpus-building, for which no other location appeared suitable. It begins with a review of the distinction between unitary and composite texts, and of the different methods provided by these Guidelines for representing composite texts of different kinds (section 15.1 Varieties of Composite Text). Section 15.2 Contextual Information describes a set of additional header elements provided for the documentation of contextual information, of importance largely though not exclusively to language corpora. This is the additional module for language corpora proper. Section 15.3 Associating Contextual Information with a Text discusses a mechanism by which individual parts of the TEI Header may be associated with different parts of a TEI-conformant text. Section 15.4 Linguistic Annotation of Corpora reviews various methods of providing linguistic annotation in corpora, with some specific examples of relevance to current practice in corpus linguistics. Finally, section 15.5 Recommendations for the Encoding of Large Corpora provides some general recommendations about the use of these Guidelines in the building of large corpora.
15.1 Varieties of Composite TextTEI: Varieties of Composite Text¶
- teiCorpus contient la totalité d'un corpus encodé selon la TEI, comprenant un seul en-tête de corpus et un ou plusieurs éléments TEI dont chacun contient un seul en-tête textuel et un texte
- TEI (document TEI) contient un seul document conforme à la TEI, qui comprend un en-tête TEI et un texte, soit de façon isolée soit comme partie d’un élément teiCorpus.
- teiHeader (en-tête TEI) donne des informations descriptives et déclaratives qui
constituent une page de titre électronique au début de tout texte conforme à la TEI.
type spécifie le type de document auquel l'en-tête TEI se rapporte. - text (texte) contient un seul texte quelconque, simple ou composite, par exemple un poème ou une pièce de théâtre, un recueil d’essais, un roman, un dictionnaire ou un échantillon de corpus.
- group (groupe) contient le corps d’un texte composite qui regroupe une suite de textes distincts (ou des groupes de textes de ce type), considérés comme formant une unité dans un but quelconque, par exemple pour présenter les œuvres complètes d’un auteur, une suite d’essais en prose, etc.
- language corpora
- collections or anthologies
- poem cycles and epistolary works (novels or essays written in the form of collections or series of letters)
- otherwise unitary texts, within which one or more subordinate texts are embedded
In corpora, the component samples are clearly distinct texts, but the systematic collection, standardized preparation, and common markup of the corpus often make it useful to treat the entire corpus as a unit, too. Some corpora may become so well established as to be regarded as texts in their own right; the Brown and LOB corpora are now close to achieving this status.
<teiHeader type="corpus"/>
<TEI>
<teiHeader type="text"/>
<text/>
</TEI>
<TEI>
<teiHeader type="text"/>
<text/>
</TEI>
</teiCorpus>
Header information which relates to the whole corpus rather than to individual components of it should be factored out and included in the teiHeader element prefixed to the whole. This two-level structure allows for contextual information to be specified at the corpus level, at the individual text level, or at both. Discussion of the kinds of information which may thus be specified is provided below, in section 15.2 Contextual Information, as well as in chapter 2 The TEI Header. Information of this type should in general be specified only once: a variety of methods are provided for associating it with individual components of a corpus, as further described in section 15.3 Associating Contextual Information with a Text.
In some cases, the design of a corpus is reflected in its internal structure. For example, a corpus of newspaper extracts might be arranged to combine all stories of one type (reportage, editorial, reviews, etc.) into some higher-level grouping, possibly with sub-groups for date, region, etc. The teiCorpus element provides no direct support for reflecting such internal corpus structure in the markup: it treats the corpus as an undifferentiated series of components, each tagged TEI.
If it is essential to reflect a single permanent organization of a corpus into sub- and sub-sub-corpora, then the corpus or the high-level subcorpora may be encoded as composite texts, using the group element described below and in section 4.3.1 Grouped Texts. The mechanisms for corpus characterization described in this chapter, however, are designed to reduce the need to do this. Useful groupings of components may easily be expressed using the text classification and identification elements described in section 15.2.1 The Text Description, and those for associating declarations with corpus components described in section 15.3 Associating Contextual Information with a Text. These methods also allow several different methods of text grouping to co-exist, each to be used as needed at different times. This helps minimize the danger of cross-classification and mis-classification of samples, and helps improve the flexibility with which parts of a corpus may be characterized for different applications.
Anthologies and collections are often treated as texts in their own right, if only for historical reasons. In conventional publishing, at least, anthologies are published as units, with single editorial responsibility and common front and back matter which may need to be included in their electronic encodings. The texts collected in the anthology, of course, may also need to be identifiable as distinct individual objects for study.
Poem cycles, epistolary novels, and epistolary essays differ from anthologies in that they are often written as single works, by single authors, for single occasions; nevertheless, it can be useful to treat their constituent parts as individual texts, as well as the cycle itself. Structurally, therefore, they may be treated in the same way as anthologies: in both cases, the body of the text is composed largely of other texts.
The group element is provided to simplify the encoding of collections, anthologies, and cyclic works; as noted above, the group element can also be used to record the potentially complex internal structure of language corpora. For a full description, see chapter 4 Default Text Structure.
Some composite texts, finally, are neither corpora, nor anthologies, nor cyclic works: they are otherwise unitary texts within which other texts are embedded. In general, they may be treated in the same way as unitary texts, using the normal TEI and body elements. The embedded text itself may be encoded using the text element, which may occur within quotations or between paragraphs or other chunk-level elements inside the sections of a larger text. For further discussion, see chapter 4 Default Text Structure.
All composite texts share the characteristic that their different component texts may be of structurally similar or dissimilar types. If all component texts may all be encoded using the same module, then no problem arises. If however they require different modules, then these must be included in the schema. This process is described in more detail in section 1.1 TEI Modules.
15.2 Contextual InformationTEI: Contextual Information¶
Contextual information is of particular importance for collections or corpora composed of samples from a variety of different kinds of text. Examples of such contextual information include: the age, sex, and geographical origins of participants in a language interaction, or their socio-economic status; the cost and publication data of a newspaper; the topic, register or factuality of an extract from a textbook. Such information may be of the first importance, whether as an organizing principle in creating a corpus (for example, to ensure that the range of values in such a parameter is evenly represented throughout the corpus, or represented proportionately to the population being sampled), or as a selection criterion in analysing the corpus (for example, to investigate the language usage of some particular vector of social characteristics).
Such contextual information is potentially of equal importance for unitary texts, and these Guidelines accordingly make no particular distinction between the kinds of information which should be gathered for unitary and for composite texts. In either case, the information should be recorded in the appropriate section of a TEI Header, as described in chapter 2 The TEI Header. In the case of language corpora, such information may be gathered together in the overall corpus header, or split across all the component texts of a corpus, in their individual headers, or divided between the two. The association between an individual corpus text and the contextual information applicable to it may be made in a number of ways, as further discussed in section 15.3 Associating Contextual Information with a Text below.
Chapter 2 The TEI Header, which should be read in conjunction with the present section, describes in full the range of elements available for the encoding of information relating to the electronic file itself, for example its bibliographic description and those of the source or sources from which it was derived (see section 2.2 The File Description); information about the encoding practices followed with the corpus, for example its design principles, editorial practices, reference system, etc. (see section 2.3 The Encoding Description); more detailed descriptive information about the creation and content of the corpus, such as the languages used within it and any descriptive classification system used (see section 2.4 The Profile Description); and version information documenting any changes made in the electronic text (see section 2.5 The Revision Description).
In addition to the elements defined by chapter 2 The TEI Header, several other elements can be used in the TEI header if the additional module defined by this chapter is invoked. These additional tags make it possible to characterize the social or other situation within which a language interaction takes place or is experienced, the physical setting of a language interaction, and the participants in it. Though this information may be relevant to, and provided for, unitary texts as well as for collections or corpora, it is more often recorded for the components of systematically developed corpora than for isolated texts, and thus this module is referred to as being ‘for language corpora’.
- textDesc (description de texte) fournit la description d'un texte sous l'angle du contexte situationnel
- particDesc (description des participants) décrit les locuteurs, voix ou autres participants identifiables d'une interaction linguistique.
- settingDesc (description du contexte) décrit le(s) contexte(s) dans lesquels se situe une interaction linguistique, soit sous la forme d'une description en prose, soit sous celle d'une série d'éléments décrivant le contexte.
15.2.1 The Text DescriptionTEI: The Text Description¶
- channel (canal principal) décrit le médium ou le canal par le biais duquel un
texte est délivré ou reçu. Pour un texte écrit, cela peut être un imprimé, un manuscrit, un
courriel, etc. ; pour un texte parlé, la radio, le téléphone, un face à face, etc.
mode précise le mode de ce canal relatif à l'oral et à l'écrit. - constitution (composition) décrit la composition interne d'un texte ou d'une partie de texte, par exemple : fragmentaire, complet, etc.
type précise comment le texte a été constitué. - derivation (dérivation) décrit la nature et le degré d'originalité de ce
texte.
type catégorise la dérivation du texte. - domain (domaine d'usage) décrit le contexte social principal dans lequel le
texte a été réalisé ou pour lequel il est conçu, par exemple : sphère privée ou publique,
contexte éducatif, religieux, etc.
type catégorise le domaine d'usage - factuality (degré de réalité) décrit le degré de fiction ou de réalité caractérisant un texte,
c'est-à-dire s'il décrit un monde imaginaire ou réel.
type détermine le caractère factuel ou non du texte. - interaction (interaction) décrit l'étendue, la cardinalité et la nature de
toute interaction entre ceux qui produisent et ceux qui reçoivent le texte, par exemple sous
forme d'une réponse ou d'une interjection, d'un commentaire, etc.
type précise le degré d'interaction entre les participants actifs et passifs au sein du texte active précise le nombre de participants actifs (ou émetteurs) qui produisent des parties du texte passive précise le nombre de participants passifs (ou récepteurs) à qui un texte est destiné ou en la présence de qui il est créé ou représenté - preparedness (degré de préparation) décrit le degré de préparation ou de spontanéité d'un
texte
type un mot clé caractérisant le type de préparation - purpose caractérise une intention ou une fonction de
communication uniques du texte.
type précise une intention particulière degree précise à quel degré cette intention prédomine.
These elements constitute a model class called model.textDescPart; new parameters may be defined by defining new elements and adding them to that class, as further described in 23.2 Personalization and Customization.
By default, a text description will contain each of the above elements, supplied in the order specified. Except for the purpose element, which may be repeated to indicate multiple purposes, no element should appear more than once within a single text description. Each element may be empty, or may contain a brief qualification or more detailed description of the value expressed by its attributes. It should be noted that some texts, in particular literary ones, may resist unambiguous classification in some of these dimensions; in such cases, the situational parameter in question should be given the content ‘not applicable’ or an equivalent phrase.
- it enables a relatively continuous characterization of texts (in contrast to discrete categories based on type or topic)
- it enables meaningful comparisons across corpora
- it allows analysts to build and compare their own text-types based on the particular parameters of interest to them
- it is equally applicable to spoken, written, or signed texts
Two alternative approaches to the use of these parameters are supported by these Guidelines. One is to use pre-existing taxonomies such as those used in subject classification or other types of text categorization. Such taxonomies may also be appropriate for the description of the topics addressed by particular texts. Elements for this purpose are described in section 2.4.3 The Text Classification, and elements for defining or declaring such classification schemes in section 2.3.6 The Classification Declaration. A second approach is to develop an application-specific set of feature structures and an associated feature system declaration, as described in chapters 18 Feature Structures and 18.11 Feature System Declaration.
Where the organizing principles of a corpus or collection so permit, it may be convenient to regard a particular set of values for the situational parameters listed in this section as forming a text-type in its own right; this may also be useful where the same set of values applies to several texts within a corpus. In such a case, the set of text-types so defined should be regarded as a taxonomy. The mechanisms described in section 2.3.6 The Classification Declaration may be used to define hierarchic taxonomies of such text-types, provided that the catDesc component of the category element contains a textDesc element rather than a prose description. Particular texts may then be associated with such definitions using the mechanisms described in sections 2.4.3 The Text Classification.
<channel mode="s">informal face-to-face conversation</channel>
<constitution type="single">each text represents a continuously
recorded interaction among the specified participants
</constitution>
<derivation type="original"/>
<domain type="domestic">plans for coming week, local affairs</domain>
<factuality type="mixed">mostly factual, some jokes</factuality>
<interaction type="complete" active="plural" passive="many"/>
<preparedness type="spontaneous"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>
<channel mode="w">print; part issues</channel>
<constitution type="single"/>
<derivation type="original"/>
<domain type="art"/>
<factuality type="fiction"/>
<interaction type="none"/>
<preparedness type="prepared"/>
<purpose type="entertain" degree="high"/>
<purpose type="inform" degree="medium"/>
</textDesc>
15.2.2 The Participant DescriptionTEI: The Participant Description¶
The particDesc element in the profileDesc element provides additional information about the participants in a spoken text or, where this is judged appropriate, the persons named or depicted in a written text. When the detailed elements provided by the namesdates module described in 13 Names, Dates, People, and Places are included in a schema, this element can contain detailed demographic or descriptive information about individual speakers or groups of speakers, such as their names or other personal characteristics. Individually identified persons may also identified by a code which can then be used elsewhere within the encoded text, for example as the value of a who attribute.
It should be noted that although the terms speaker or participant are used throughout this section, it is intended that the same mechanisms may be used to characterize fictional personæ or ‘voices’ within a written text, except where otherwise stated. For the purposes of analysis of language usage, the information specified here should be equally applicable to written, spoken, or signed texts.
The element particDesc contains a description of the participants in an interaction, which may be supplied as straightforward prose, possibly containing a list of names, encoded using the usual list and name elements, or alternatively using the more specific and detailed listPerson element provided by the namesdates module described in 13 Names, Dates, People, and Places.
<p>Female informant, well-educated, born in Shropshire UK, 12 Jan
1950, of unknown occupation. Speaks French fluently.
Socio-Economic status B2 in the PEP classification scheme.</p>
</particDesc>
<birth when="1950-01-12">
<date>12 Jan 1950</date>
<name type="place">Shropshire, UK</name>
</birth>
<langKnowledge tags="en fr">
<langKnown level="first" tag="en">English</langKnown>
<langKnown tag="fr">French</langKnown>
</langKnowledge>
<residence>Long term resident of Hull</residence>
<education>University postgraduate</education>
<occupation>Unknown</occupation>
<socecStatus scheme="#pep" code="#b2"/>
</person>
<p>The chief speaking characters in this novel are
<list>
<item xml:id="EMWOO">
<name>Emma Woodhouse</name>
</item>
<item xml:id="DARCY">
<name>Mr Darcy</name>
</item>
<!-- ... -->
</list>
</p>
</particDesc>
15.2.3 The Setting DescriptionTEI: The Setting Description¶
The settingDesc element is used to describe the setting or settings in which language interaction takes place. It may contain a prose description, analogous to a stage description at the start of a play, stating in broad terms the locale, or a more detailed description of a series of such settings.
- setting précise un contexte particulier dans lequel a lieu une interaction linguistique.
- name (nom, nom propre) contient un nom propre ou un syntagme nominal
- date (date) contient une date exprimée dans n'importe quel format.
- time (temps) contient une expression qui précise un moment de la journée sous n'importe quelle forme.
- locale contient une description brève et informelle de la nature d'un lieu, par exemple une pièce, un restaurant, un banc dans un parc, etc.
- activity (activité) contient une description brève et informelle de ce que fait, le cas échéant, un participant à une interaction linguistique, en dehors de parler.
<p>The time is early spring, 1989. P1 and P2 are playing on the rug
of a suburban home in Bedford. P3 is doing the washing up at the
sink. P4 (a radio announcer) is in a broadcasting studio in
London.</p>
</settingDesc>
<setting who="#p1 #p2">
<name type="city">Bedford</name>
<name type="region">UK: South East</name>
<date>early spring, 1989</date>
<locale>rug of a suburban home</locale>
<activity>playing</activity>
</setting>
<setting who="#p3">
<name type="city">Bedford</name>
<name type="region">UK: South East</name>
<date>early spring, 1989</date>
<locale>at the sink</locale>
<activity>washing-up</activity>
</setting>
<setting who="#p4">
<name type="place">London, UK</name>
<time>unknown</time>
<locale>broadcasting studio</locale>
<activity>radio performance</activity>
</setting>
</settingDesc>
15.3 Associating Contextual Information with a TextTEI: Associating Contextual Information with a Text¶
This section discusses the association of the contextual information held in the header with the individual elements making up a TEI text or corpus. Contextual information is held in elements of various kinds within the TEI header, as discussed elsewhere in this section and in chapter 2 The TEI Header. Here we consider what happens when different parts of a document need to be associated with different contextual information of the same type, for example when one part of a document uses a different encoding practice from another, or where one part relates to a different setting from another. In such situations, there will be more than one instance of a header element of the relevant type.
- A given element may appear in the corpus header only, in the header of one or more texts only, or in both places
- There may be multiple occurrences of certain elements in either corpus or text header.
To simplify the exposition, we deal with these two possibilities separately in what follows; however, they may be combined as desired.
15.3.1 Combining Corpus and Text HeadersTEI: Combining Corpus and Text Headers¶
A TEI-conformant document may have more than one header only in the case of a TEI corpus, which must have a header in its own right, as well as the obligatory header for each text. Every element specified in a corpus-header is understood as if it appeared within every text header in the corpus. An element specified in a text header but not in the corpus header supplements the specification for that text alone. If any element is specified in both corpus and text headers, the corpus header element is over-ridden for that text alone.
The titleStmt for a corpus text is understood to be prefixed by the titleStmt given in the corpus header. All other optional elements of the fileDesc should be omitted from an individual corpus text header unless they differ from those specified in the corpus header. All other header elements behave identically, in the manner documented below. This facility makes it possible to state once for all in the corpus header each piece of contextual information which is common to the whole of the corpus, while still allowing for individual texts to vary from this common denominator.
<teiHeader>
<fileDesc>
<!-- corpus file description-->
</fileDesc>
<encodingDesc>
<!-- default encoding description -->
</encodingDesc>
<revisionDesc>
<!-- corpus revision description -->
</revisionDesc>
</teiHeader>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
</teiHeader>
<text>
<!-- first corpus text -->
</text>
</TEI>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for this corpus text -->
</fileDesc>
<encodingDesc>
<!-- encoding description for this corpus text, over-riding the default -->
</encodingDesc>
</teiHeader>
<text>
<!-- second corpus text -->
</text>
</TEI>
<TEI>
<teiHeader>
<fileDesc>
<!-- file description for third corpus text -->
</fileDesc>
</teiHeader>
<text>
<!-- third corpus text -->
</text>
</TEI>
</teiCorpus>
15.3.2 Declarable ElementsTEI: Declarable Elements¶
Certain of the elements which can appear within a TEI Header are known as declarable elements. These elements have in common the fact that they may be linked explicitly with a particular part of a text or corpus by means of a decls attribute on that element. This linkage is used to over-ride the default association between declarations in the header and a corpus or corpus text. The only header elements which may be associated in this way are those which would not otherwise be meaningfully repeatable.
- att.declarable fournit des attributs pour les éléments dans
l'en-tête TEI qui peuvent être choisis indépendamment au moyen de l'attribut
declsprévu dans ce but
default Indique si oui ou non cet élément est affecté par défaut quand son élément parent a été sélectionné. - att.declaring fournit des attributs pour les éléments qui peuvent
être associés indépendamment à un élément particulier déclarable dans l'en-tête TEI, ignorant
ainsi la valeur dont cet élément devrait hériter par défaut
decls identifie un ou plusieurséléments déclarables dans l'en-tête TEI, qui sont destinés à s'appliquer à l'élément portant cet attribut et à son contenu.
- availability (disponibilité) renseigne sur la disponibilité du texte, par exemple sur toutes restrictions quant à son usage ou sa diffusion, son statut de copyright, etc.
- bibl (référence bibliographique.) contient une référence bibliographique faiblement structurée dans laquelle les sous-composants peuvent ou non être explicitement balisés.
- biblFull (référence bibliographique totalement structurée) contient une référence bibliographique totalement structurée : tous les composants de la description du fichier TEI y sont présents.
- biblStruct (référence bibliographique structurée) contient une référence bibliographique dans laquelle seuls des sous-éléments bibliographiques apparaissent et cela, selon un ordre déterminé.
- broadcast (diffusion) décrit une émission utilisée comme source de la parole transcrite.
- correction (règles de correction) établit comment et dans quelles circonstances des corrections ont été apportées au texte.
- editorialDecl (déclaration des pratiques éditoriales) donne des précisions sur les pratiques et les principes éditoriaux appliqués au cours de l’encodage du texte.
- equipment (matériel) fournit des détails techniques sur les appareils et les supports servant à l’enregistrement audio ou vidéo utilisé comme source de la parole transcrite.
- hyphenation (césurage) résume la façon dont les traits d'union sécants de fin de ligne d’un texte source ont été traités dans sa version encodée.
- interpretation (Interprétation) décrit le champ d’application de toute information analytique ou interprétative ajoutée à la transcription du texte.
- langUsage (langue utilisée) décrit les langues, variétés de langues, registres, dialectes, etc. présents à l’intérieur d’un texte.
- listBibl (liste de références bibliographiques) contient une liste de références bibliographiques de toute nature.
- normalization (normalisation) indique l'extension de la normalisation ou de la régularisation effectuée sur le texte source dans sa conversion vers sa forme électronique.
- particDesc (description des participants) décrit les locuteurs, voix ou autres participants identifiables d'une interaction linguistique.
- projectDesc (description du projet) décrit en détail le but ou l’objectif visé dans l’encodage d’un fichier électronique, ainsi que toute autre information pertinente sur la manière dont il a été construit ou recueilli.
- quotation (citation) décrit la pratique éditoriale adoptée par rapport aux guillements dans l’original.
- recording (enregistrement) décrit en détail l’événement audio ou vidéo utilisé comme source de la parole transcrite, que ce soit un enregistrement direct ou une émission diffusée.
- samplingDecl (déclaration d'échantillonnage) contient une description en texte libre du raisonnement et des méthodes utilisés pour l'échantillonnage des textes dans la création d’un corpus ou d’une collection.
- scriptStmt (déclaration du script) contient une citation donnant des détails sur le script à l’origine de la parole. [le terme ‘script’ est entendu au sens large dans ce document comme tout texte préparatoire à une prise de parole (discours politique, sermon, interview, allocution, conférence, émission, etc.)].
- segmentation (segmentation) décrit les principes selon lesquels le texte a été segmenté, par exemple en phrases, en intonèmes (unités tonales), en strates graphématiques (niveaux superposés de signes graphiques), etc.
- sourceDesc (description de la source) décrit la source à partir de laquelle un texte électronique a été dérivé ou produit, habituellement une description bibliographique pour un texte numérisé, ou une expression comme "document numérique natif " pour un texte qui n'a aucune existence précédente.
- stdVals (valeurs normalisées) précise le format utilisé pour exprimer une date ou une valeur numérique de manière normalisée .
- textClass (classification du texte) regroupe des informations décrivant la nature ou le sujet d’un texte selon des termes issus d’un système de classification standardisé, d’un thésaurus, etc.
- textDesc (description de texte) fournit la description d'un texte sous l'angle du contexte situationnel
- every declarable element must bear a unique identifier
- for each different type of declarable element which occurs more than once within the same parent element, exactly one element must be specified as the default, by means of the default attribute
<correction xml:id="CorPol1" default="true">
<p> ... </p>
</correction>
<correction xml:id="CorPol2">
<p> ... </p>
</correction>
<normalization xml:id="n1">
<p> ... </p>
<p> ... </p>
</normalization>
</editorialDecl>
<body>
<div1 n="d1"/>
<div1 n="d2" decls="#CorPol2"/>
<div1 n="d3"/>
</body>
</text>
The decls attribute is defined for any element which is a member of the class declaring. This includes the major structural elements text, group, and div, as well as smaller structural units, down to the level of paragraphs in prose, individual utterances in spoken texts, and entries in dictionaries. However, TEI recommended practice is to limit the number of multiple declarable elements used by a document as far as possible, for simplicity and ease of processing.
- An identifier specifying an element which contains multiple instances of one or more other elements should be interpreted as if it explicitly identified the elements identified as the default in each such set of repeated elements
- Each element specified, explicitly or implicitly, by the list of identifiers must be of a different kind.
<editorialDecl xml:id="ED1" default="true">
<correction xml:id="C1A" default="true">
<p> ... </p>
</correction>
<correction xml:id="C1B">
<p> ... </p>
</correction>
<normalization xml:id="N1">
<p> ... </p>
<p> ... </p>
</normalization>
</editorialDecl>
<editorialDecl xml:id="ED2">
<correction xml:id="C2A" default="true">
<p> ... </p>
</correction>
<correction xml:id="C2B">
<p> ... </p>
</correction>
<normalization xml:id="N2A">
<p> ... </p>
</normalization>
<normalization xml:id="N2B" default="true">
<p> ... </p>
</normalization>
</editorialDecl>
</encodingDesc>
This encoding description now has two editorial declarations, identified as ED1 (the default) and ED2. For texts not specifying otherwise, ED1 will apply. If ED1 applies, correction method C1A and normalization method N1 apply, since these are the specified defaults within ED1. In the same way, for a text specifying decls as ‘ED2’, correction C2A, and normalization N2B will apply.
A finer grained approach is also possible. A text might specify <text decls='C2B N2A'>, to ‘mix and match’ declarations as required. A tag such as <text decls='ED1 ED2'> would (obviously) be illegal, since it includes two elements of the same type; a tag such as <text decls='ED2 C1A'> is also illegal, since in this context ED2 is synonymous with the defaults for that editorial declaration, namely C2A N2B, resulting in a list that identifies two correction elements (C1A and C2A).
15.3.3 SummaryTEI: Summary¶
- If there is a single occurrence of a given declarable element in a corpus header, then it applies by default to all elements within the corpus.
- If there is a single occurrence of a given declarable element in the text header, then it applies by default to all elements of that text irrespective of the contents of the corpus header.
- Where there are multiple occurrences of declarable elements
within either corpus or text header,
- each must have a unique value specified as the value of its xml:id attribute;
- one only must bear a default attribute with the value YES.
- It is a semantic error for an element to be associated with more than one occurrence of any declarable element.
- Selecting an element which contains multiple occurrences of a given declarable element is semantically equivalent to selecting only those contained elements which are specified as defaults.
- An association made by one element applies by default to all of its descendants.
15.4 Linguistic Annotation of CorporaTEI: Linguistic Annotation of Corpora¶
Language corpora often include analytic encodings or annotations, designed to support a variety of different views of language. The present Guidelines do not advocate any particular approach to linguistic annotation (or ‘tagging’); instead a number of general analytic facilities are provided which support the representation of most forms of annotation in a standard and self-documenting manner. Analytic annotation is of importance in many fields, not only in corpus linguistics, and is therefore discussed in general terms elsewhere in the Guidelines. 48 The present section presents informally some particular applications of these general mechanisms to the specific practice of corpus linguistics.
15.4.1 Levels of AnalysisTEI: Levels of Analysis¶
By linguistic annotation we mean here any annotation determined by an analysis of linguistic features of the text, excluding as borderline cases both the formal structural properties of the text (e.g. its division into chapters or paragraphs) and descriptive information about its context (the circumstances of its production, its genre, or medium). The structural properties of any TEI-conformant text should be represented using the structural elements discussed elsewhere in these Guidelines, for example in chapters 3 Elements Available in All TEI Documents and 4 Default Text Structure. The contextual properties of a TEI text are fully documented in the TEI Header, which is discussed in chapter 2 The TEI Header, and in section 15.2 Contextual Information of the present chapter.
Other forms of linguistic annotation may be applied at a number of levels in a text. A code (such as a word-class or part-of-speech code) may be associated with each word or token, or with groups of such tokens, which may be continuous, discontinuous, or nested. A code may also be associated with relationships (such as cohesion) perceived as existing between distinct parts of a text. The codes themselves may stand for discrete non-decomposable categories, or they may represent highly articulated bundles of textual features. Their function may be to place the annotated part of the text somewhere within a narrowly linguistic or discoursal domain of analysis, or within a more general semantic field, or any combination drawn from these and other domains.
The manner by which such annotations are generated and attached to the text may be entirely automatic, entirely manual, or a mixture. The ease and accuracy with which analysis may be automated may vary with the level at which the annotation is attached. The method employed should be documented in the interpretation element within the encoding description of the TEI Header, as described in section 2.3.3 The Editorial Practices Declaration. Where different parts of a corpus have used different annotation methods, the decls attribute may be used to indicate the fact, as further discussed in section 15.3 Associating Contextual Information with a Text.
An extended example of one form of linguistic analysis commonly practised in corpus linguistics is given in section 17.4 Linguistic Annotation.
15.5 Recommendations for the Encoding of Large CorporaTEI: Recommendations for the Encoding of Large Corpora¶
These Guidelines include proposals for the identification and encoding of a far greater variety of textual features and characteristics than is likely to be either feasible or desirable in any one language corpus, however large and ambitious. The reasoning behind this catholic approach is further discussed in chapter iv. About These Guidelines. For most large-scale corpus projects, it will therefore be necessary to determine a subset of TEI recommended elements appropriate to the anticipated needs of the project, as further discussed in chapter 23.2 Personalization and Customization; these mechanisms include the ability to exclude selected element types, add new element types, and change the names of existing elements. A discussion of the implications of such changes for TEI conformance is provided in chapter 23.3 Conformance.
- required
- texts included within the corpus will always encode textual features in this category, should they exist in the text
- recommended
- textual features in this category will be encoded wherever economically and practically feasible; where present but not encoded, a note in the header should be made.
- optional
- textual features in this category may or may not be encoded; no conclusion about the absence of such features can be inferred from the absence of the corresponding element in a given text.
- proscribed
- textual features in this category are deliberately not encoded; they may be transcribed as unmarked up text, or represented as gap elements, or silently omitted, as appropriate.
15.6 Module for Language CorporaTEI: Module for Language Corpora¶
- Module corpus: Corpus linguistiques
↑ Contenu « 14 Tables, Formulæ, and Graphics » 16 Linking, Segmentation, and Alignment