8 Transcriptions of Speech

Table of contents

The module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction.

This chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section 8.1 General Considerations and Overview). Section 8.2 Documenting the Source of Transcribed Speech documents some additional TEI header elements which may be used to document the recording or other source from which transcribed text is taken. Section 8.3 Elements Unique to Spoken Texts describes the basic structural elements provided by this module. Finally, section 8.4 Elements Defined Elsewhere of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them.

TEI: General Considerations and Overview¶8.1 General Considerations and Overview

There is great variation in the ways different researchers have chosen to represent speech using the written medium.³² This reflects the special difficulties which apply to the encoding or transcription of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones.

Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or turns of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments.

Spoken texts transcribed according to the guidelines presented here are organized as follows. The overall structure of a TEI spoken text is identical to that of any other TEI text: the TEI element for a spoken text contains a teiHeader element, followed by a text element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts.

We may say, therefore, that these Guidelines regard transcribed speech as being composed of arbitrary high-level units called texts. A spoken text might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a teiHeader providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter 2 The TEI Header; the particular elements it provides for use with spoken texts are described below (8.2 Documenting the Source of Transcribed Speech). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in 15.2 Contextual Information.

Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a lecture, a broadcast item, a meeting, etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that:

it is internally cohesive,
it is describable by a single header, and
it represents a single stretch of time with no significant discontinuities.

Within a text it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. The neutral div element discussed in section 4.1 Divisions of the Body is recommended for this purpose. It may be found useful also for representing subdivisions relating to discourse structure, speech act theory, transactional analysis, etc., provided only that these divisions are hierarchically well-behaved. Where they are not, as is often the case, the mechanisms discussed in chapters 16 Linking, Segmentation, and Alignment and 20 Non-hierarchical Structures may be used.

A spoken text may contain any of the following components:

utterances
pauses
vocalized but non-lexical phenomena such as coughs
kinesic (non-verbal, non-lexical) phenomena such as gestures
entirely non-linguistic incidents occurring during and possibly influencing the course of speech
writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture
shifts or changes in vocal quality

Elements to represent all of these features of spoken language are discussed in section 8.3 Elements Unique to Spoken Texts below.

An utterance (tagged u) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. The u element can thus contain any of the other elements listed, interspersed with a transcription of the lexical items of the utterance; the other elements may all appear between utterances or next to each other, but except for writing they do not contain any other elements nor any data.

A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a ‘text’ in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a ‘text’ may appear even more debatable. Nevertheless, such divisions may be useful for such types of discourse as debates, broadcasts, etc., where structural subdivisions can easily be identified, or more generally wherever it is desired to aggregate utterances or other parts of a transcript into units smaller than a complete ‘text’. Examples might include ‘conversations’ or ‘discourse fragments’, or more narrowly, ‘that part of the conversation where topic x was discussed’, provided only that the set of all such divisions is coextensive with the text.

Each such division of a spoken text should be represented by the numbered or unnumbered div elements defined in chapter 4 Default Text Structure. For some detailed kinds of analysis a hierarchy of such divisions may be found useful; nested div elements may be used for this purpose, as in the following example showing how a collection made up of transcribed ‘sound bites’ taken from speeches given by a politician on different occasions might be encoded. Each extract is regarded as a distinct div, nested within a single composite div as follows:

ident	supplies an identifier for the encoding convention, independent of any version number.
version	supplies a version number for the encoding conventions used, if any.

type	characterizes the element in some sense, using any convenient classification scheme or typology.
subtype	provides a sub-categorization of the element, if needed

start	indicates the location within a temporal alignment at which this element begins.
end	indicates the location within a temporal alignment at which this element ends.

	eventive	communicative	anthropophonic	lexical
incident	+	-	-	-
kinesic	+	+	-	-
vocal	+	+	+	-
utterance	+	+	+	+

feature	a paralinguistic feature. Suggested values include: 1] tempo; 2] loud; 3] pitch; 4] tension; 5] rhythm; 6] voice
new	specifies the new state of the paralinguistic feature specified.

P5: Guidelines for Electronic Text Encoding and Interchange

8 Transcriptions of Speech

TEI: General Considerations and Overview¶8.1 General Considerations and Overview

TEI: Documenting the Source of Transcribed Speech¶8.2 Documenting the Source of Transcribed Speech

TEI: Elements Unique to Spoken Texts¶8.3 Elements Unique to Spoken Texts

TEI: Utterances¶8.3.1 Utterances

TEI: Pausing¶8.3.2 Pausing

TEI: Vocal, Kinesic, Incident¶8.3.3 Vocal, Kinesic, Incident

TEI: Writing¶8.3.4 Writing

TEI: Temporal Information¶8.3.5 Temporal Information

TEI: Shifts¶8.3.6 Shifts

TEI: Elements Defined Elsewhere¶8.4 Elements Defined Elsewhere

TEI: Segmentation¶8.4.1 Segmentation

TEI: Synchronization and Overlap¶8.4.2 Synchronization and Overlap

TEI: Regularization of Word Forms¶8.4.3 Regularization of Word Forms

TEI: Prosody¶8.4.4 Prosody

TEI: Speech Management¶8.4.5 Speech Management

TEI: Analytic Coding¶8.4.6 Analytic Coding

TEI: Module for Transcribed Speech¶8.5 Module for Transcribed Speech