8 Transcriptions of Speech
Table of contents
The module described in this chapter is intended for use with a wide variety of transcribed spoken material. It should be stressed, however, that the present proposals are not intended to support unmodified every variety of research undertaken upon spoken material now or in the future; some discourse analysts, some phonologists, and doubtless others may wish to extend the scheme presented here to express more precisely the set of distinctions they wish to draw in their transcriptions. Speech regarded as a purely acoustic phenomenon may well require different methods from those outlined here, as may speech regarded solely as a process of social interaction.
This chapter begins with a discussion of some of the problems commonly encountered in transcribing spoken language (section 8.1 General Considerations and Overview). Section 8.2 Documenting the Source of Transcribed Speech documents some additional TEI header elements which may be used to document the recording or other source from which transcribed text is taken. Section 8.3 Elements Unique to Spoken Texts describes the basic structural elements provided by this module. Finally, section 8.4 Elements Defined Elsewhere of this chapter reviews further problems specific to the encoding of spoken language, demonstrating how mechanisms and elements discussed elsewhere in these Guidelines may be applied to them.
TEI: General Considerations and Overview¶8.1 General Considerations and Overview
There is great variation in the ways different researchers have chosen to represent speech using the written medium.33 This reflects the special difficulties which apply to the encoding or transcription of speech. Speech varies according to a large number of dimensions, many of which have no counterpart in writing (for example, tempo, loudness, pitch, etc.). The audibility of speech recorded in natural communication situations is often less than perfect, affecting the accuracy of the transcription. Spoken material may be transcribed in the course of linguistic, acoustic, anthropological, psychological, ethnographic, journalistic, or many other types of research. Even in the same field, the interests and theoretical perspectives of different transcribers may lead them to prefer different levels of detail in the transcript and different styles of visual display. The production and comprehension of speech are intimately bound up with the situation in which speech occurs, far more so than is the case for written texts. A speech transcript must therefore include some contextual features; determining which are relevant is not always simple. Moreover, the ethical problems in recording and making public what was produced in a private setting and intended for a limited audience are more frequently encountered in dealing with spoken texts than with written ones.
Speech also poses difficult structural problems. Unlike a written text, a speech event takes place in time. Its beginning and end may be hard to determine and its internal composition difficult to define. Most researchers agree that the utterances or turns of individual speakers form an important structural component in most kinds of speech, but these are rarely as well-behaved (in the structural sense) as paragraphs or other analogous units in written texts: speakers frequently interrupt each other, use gestures as well as words, leave remarks unfinished and so on. Speech itself, though it may be represented as words, frequently contains items such as vocalized pauses which, although only semi-lexical, have immense importance in the analysis of spoken text. Even non-vocal elements such as gestures may be regarded as forming a component of spoken text for some analytic purposes. Below the level of the individual utterance, speech may be segmented into units defined by phonological, prosodic, or syntactic phenomena; no clear agreement exists, however, even as to appropriate names for such segments.
Spoken texts transcribed according to the guidelines presented here are organized as follows. The overall structure of a TEI spoken text is identical to that of any other TEI text: the TEI element for a spoken text contains a teiHeader element, followed by a text element. Even texts primarily composed of transcribed speech may also include conventional front and back matter, and may even be organized into divisions like printed texts.
We may say, therefore, that these Guidelines regard transcribed speech as being composed of arbitrary high-level units called texts. A spoken text might typically be a conversation between a small number of people, a lecture, a broadcast TV item, or a similar event. Each such unit has associated with it a teiHeader providing detailed contextual information such as the source of the transcript, the identity of the participants, whether the speech is scripted or spontaneous, the physical and social setting in which the discourse takes place and a range of other aspects. Details of the header in general are provided in chapter 2 The TEI Header; the particular elements it provides for use with spoken texts are described below (8.2 Documenting the Source of Transcribed Speech). Details concerning additional elements which may be used for the documentation of participant and contextual information are given in 15.2 Contextual Information.
Defining the bounds of a spoken text is frequently a matter of arbitrary convention or convenience. In public or semi-public contexts, a text may be regarded as synonymous with, for example, a lecture, a broadcast item, a meeting, etc. In informal or private contexts, a text may be simply a conversation involving a specific group of participants. Alternatively, researchers may elect to define spoken texts solely in terms of their duration in time or length in words. By default, these Guidelines assume of a text only that:
- it is internally cohesive,
- it is describable by a single header, and
- it represents a single stretch of time with no significant discontinuities.
Within a text it may be necessary to identify subdivisions of various kinds, if only for convenience of handling. The neutral div element discussed in section 4.1 Divisions of the Body is recommended for this purpose. It may be found useful also for representing subdivisions relating to discourse structure, speech act theory, transactional analysis, etc., provided only that these divisions are hierarchically well-behaved. Where they are not, as is often the case, the mechanisms discussed in chapters 16 Linking, Segmentation, and Alignment and 20 Non-hierarchical Structures may be used.
A spoken text may contain any of the following components:
- utterances
- pauses
- vocalized but non-lexical phenomena such as coughs
- kinesic (non-verbal, non-lexical) phenomena such as gestures
- entirely non-linguistic incidents occurring during and possibly influencing the course of speech
- writing, regarded as a special class of incident in that it can be transcribed, for example captions or overheads displayed during a lecture
- shifts or changes in vocal quality
Elements to represent all of these features of spoken language are discussed in section 8.3 Elements Unique to Spoken Texts below.
An utterance (tagged u) may contain lexical items interspersed with pauses and non-lexical vocal sounds; during an utterance, non-linguistic incidents may occur and written materials may be presented. The u element can thus contain any of the other elements listed, interspersed with a transcription of the lexical items of the utterance; the other elements may all appear between utterances or next to each other, but except for writing they do not contain any other elements nor any data.
A spoken text itself may be without substructure, that is, it may consist simply of units such as utterances or pauses, not grouped together in any way, or it may be subdivided. If the notion of what constitutes a ‘text’ in spoken discourse is inevitably rather an arbitrary one, the notion of formal subdivisions within such a ‘text’ may appear even more debatable. Nevertheless, such divisions may be useful for such types of discourse as debates, broadcasts, etc., where structural subdivisions can easily be identified, or more generally wherever it is desired to aggregate utterances or other parts of a transcript into units smaller than a complete ‘text’. Examples might include ‘conversations’ or ‘discourse fragments’, or more narrowly, ‘that part of the conversation where topic x was discussed’, provided only that the set of all such divisions is coextensive with the text.
subtype="conservative" org="composite">
<div sample="medial"/>
<div sample="medial"/>
<div sample="initial"/>
</div>
As a member of the class att.declaring, the div element may also carry a decls attribute, for use where the divisions of a text do not all share the same set of the contextual declarations specified in the TEI header. (See further section 15.3 Associating Contextual Information with a Text).
TEI: Documenting the Source of Transcribed Speech¶8.2 Documenting the Source of Transcribed Speech
Where a computer file is derived from a spoken text rather than a written one, it will usually be desirable to record additional information about the recording or broadcast which constitutes its source. Several additional elements are provided for this purpose within the source description component of the TEI header:
- scriptStmt (script statement) contains a citation giving details of the script used for a spoken text.
- recordingStmt (recording statement) describes a set of recordings used as the basis for transcription of a spoken text.
- recording (recording event) provides details of an audio or video recording event used as the source of a spoken text, either directly or from a public broadcast.
type the kind of recording. - transcriptionDesc describes the set of transcription conventions used, particularly for spoken material.
ident supplies an identifier for the encoding convention, independent of any version number. version supplies a version number for the encoding conventions used, if any.
As a member of the att.duration class, the recording element inherits the following attribute:
- att.duration.w3c provides attributes for recording normalized temporal durations.
dur (duration) indicates the length of this element in time.
Note that detailed information about the participants or setting of an interview or other transcript of spoken language should be recorded in the appropriate division of the profile description, discussed in chapter 15 Language Corpora, rather than as part of the source description. The source description is used to hold information only about the source from which the transcribed speech was taken, for example, any script being read and any technical details of how the recording was produced. If the source was a previously-created transcript, it should be treated in the same way as any other source text.
<scriptStmt xml:id="CNN12">
<bibl>
<author>CNN Network News</author>
<title>News headlines</title>
<date when="1991-06-12">12 Jun 91</date>
</bibl>
</scriptStmt>
</sourceDesc>
The recordingStmt is used to group together information relating to the recordings from which the spoken text was transcribed. The element may contain either a prose description or, more helpfully, one or more recording elements, each corresponding with a particular recording. The linkage between utterances or groups of utterances and the relevant recording statement is made by means of the decls attribute, described in section 15.3 Associating Contextual Information with a Text.
The recording element should be used to provide a description of how and by whom a recording was made. This information may be provided in the form of a prose description, within which such items as statements of responsibility, names, places, and dates may be identified using the appropriate phrase-level tags. Alternatively, a selection of elements from the model.recordingPart class may be provided. This element class makes available the following elements:
- date (date) contains a date in any format.
- time (time) contains a phrase defining a time of day in any format.
- respStmt (statement of responsibility) supplies a statement of responsibility for the intellectual content of a text, edition, recording, or series, where the specialized elements for authors, editors, etc. do not suffice or do not apply. May also be used to encode information about individuals or organizations which have played a role in the production or distribution of a bibliographic work.
- equipment (equipment) provides technical details of the equipment and media used for an audio or video recording used as the source for a spoken text.
- broadcast (broadcast) describes a broadcast used as the source of a spoken text.
<recording type="video">
<p>U-matic recording made by college audio-visual department staff,
available as PAL-standard VHS transfer or sound-only cassette</p>
</recording>
</recordingStmt>
<recording type="audio" dur="P30M">
<respStmt>
<resp>Location recording by</resp>
<name>Sound Services Ltd.</name>
</respStmt>
<equipment>
<p>Multiple close microphones mixed down to stereo Digital
Audio Tape, standard play, 44.1 KHz sampling frequency</p>
</equipment>
<date>12 Jan 1987</date>
</recording>
</recordingStmt>
<recording type="audio" dur="P15M"
xml:id="rec-3001">
<date>14 Feb 2001</date>
</recording>
<recording type="audio" dur="P15M"
xml:id="rec-3002">
<date>17 Feb 2001</date>
</recording>
<recording type="audio" dur="P15M"
xml:id="rec-3003">
<date>22 Feb 2001</date>
</recording>
</recordingStmt>
<equipment>
<p>Recorded from FM Radio to digital tape</p>
</equipment>
<broadcast>
<bibl>
<title>Interview on foreign policy</title>
<author>BBC Radio 5</author>
<respStmt>
<resp>interviewer</resp>
<name>Robin Day</name>
</respStmt>
<respStmt>
<resp>interviewee</resp>
<name>Margaret Thatcher</name>
</respStmt>
<series>
<title>The World Tonight</title>
</series>
<note>First broadcast on <date when="1989-11-27">27 Nov 1989</date>
</note>
</bibl>
</broadcast>
</recording>
version="2004"/>
TEI: Elements Unique to Spoken Texts¶8.3 Elements Unique to Spoken Texts
The following elements characterize spoken texts, transcribed according to these Guidelines:
- u (utterance) contains a stretch of speech usually preceded and followed by silence or by a change of speaker.
- pause (pause) marks a pause either between or within utterances.
- vocal (vocal) marks any vocalized but not necessarily lexical phenomenon, for example voiced pauses, non-lexical backchannels, etc.
- kinesic (kinesic) marks any communicative phenomenon, not necessarily vocalized, for example a gesture, frown, etc.
- incident (incident) marks any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication.
- writing (writing) contains a passage of written text revealed to participants in the course of a spoken text.
- shift (shift) marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes.
The u element may appear directly within a spoken text, and may contain any of the others; the others may also appear directly (for example, a vocal may appear between two utterances) but cannot contain a u element. In terms of the basic TEI model, therefore, we regard the u element as analogous to a paragraph, and the others as analogous to ‘phrase’ elements, but with the important difference that they can exist either as siblings or as children of utterances. The class model.divPart.spoken provides the u element; the class model.global.spoken provides the six other elements listed above.
As members of the att.ascribed class, all of these elements share the following attributes:
- att.ascribed provides attributes for elements representing speech or action that can be ascribed to a specific individual.
who indicates the person, or group of people, to whom the element content is ascribed. - att.ascribed.directed provides attributes for elements representing speech or action that can be directed at a group or individual.
toWhom indicates the person, or group of people, to whom a speech act or action is directed.
As members of the att.typed, att.timed and att.duration classes, all of these elements except shift share the following attribute:
- att.typed provides attributes which can be used to classify or subclassify elements in any way.
type characterizes the element in some sense, using any convenient classification scheme or typology. subtype (subtype) provides a sub-categorization of the element, if needed - att.timed provides attributes common to those elements which have a duration in time, expressed either absolutely or by reference to an alignment map.
start indicates the location within a temporal alignment at which this element begins. end indicates the location within a temporal alignment at which this element ends. - att.duration.w3c provides attributes for recording normalized temporal durations.
dur (duration) indicates the length of this element in time.
Each of these elements is further discussed and specified in sections 8.3.1 Utterances to 8.3.4 Writing.
We can show the relationship between four of these constituents of speech using the features eventive, communicative, anthropophonic (for sounds produced by the human vocal apparatus), and lexical:
eventive | communicative | anthropophonic | lexical | |
incident | + | - | - | - |
kinesic | + | + | - | - |
vocal | + | + | + | - |
utterance | + | + | + | + |
The differences are not always clear-cut. Among incidents might be included actions like slamming the door, which can certainly be communicative. Vocals include coughing and sneezing, which are usually involuntary noises. Equally, the distinction between utterances and vocals is not always clear, although for many analytic purposes it will be convenient to regard them as distinct. Individual scholars may differ in the way borderlines are drawn and should declare their definitions in the editorialDecl element of the header (see 2.3.3 The Editorial Practices Declaration).
<!-- ... in the <particDesc>: --><listPerson>
<person xml:id="mar">
<!-- ... -->
</person>
<person xml:id="ros">
<!-- ... -->
</person>
<person xml:id="fat">
<!-- ... -->
</person>
</listPerson>
<!-- ... in the <text>: -->
<u who="#mar">you
never <pause/> take this cat for show and tell
<pause/> meow meow</u>
<u who="#ros">yeah well I dont want to</u>
<incident>
<desc>toy cat has bell in tail which continues to make a tinkling sound</desc>
</incident>
<vocal who="#mar">
<desc>meows</desc>
</vocal>
<u who="#ros">because it is so old</u>
<u who="#mar">how <choice>
<orig>bout</orig>
<reg>about</reg>
</choice>
<emph>your</emph> cat <pause/>yours is <emph>new</emph>
<kinesic>
<desc>shows Father the cat</desc>
</kinesic>
</u>
<u trans="pause" who="#fat">thats <pause/> darling</u>
<u who="#mar">
<seg>no <emph>mine</emph> isnt old</seg>
<seg>mine is just um a little dirty</seg>
</u>
This example also uses some elements common to all TEI texts, notably the reg tag for editorial regularization. Unusually stressed syllables have been encoded with the emph element. The seg element has also been used to segment the last utterance. Further discussion of all of such options is provided in section 8.4 Elements Defined Elsewhere.
Contextual information is of particular importance in spoken texts, and should be provided by the TEI header of a text. In general, all of the information in a header is understood to be relevant to the whole of the associated text. The element u as a member of the att.declaring class, may however specify a different context by means of the decls attribute (see further section 15.3 Associating Contextual Information with a Text).
TEI: Utterances¶8.3.1 Utterances
Each distinct utterance in a spoken text is represented by a u element, described as follows:
- u (utterance) contains a stretch of speech usually preceded and followed by silence or by a change of speaker.
trans (transition) indicates the nature of the transition between this utterance and the previous one.
Use of the who attribute to associate the utterance with a particular speaker is recommended but not required. Its use implies as a further requirement that all speakers be identified by a person or personGrp element, typically in the TEI header (see section 15.2.2 The Participant Description), but it may also point to another external source of information about the speaker. Where utterances or other parts of the transcription cannot be attributed with confidence to any particular participant or group of participants, the encoder may choose to create personGrp elements with xml:id attributes such as various or unknown, and perhaps give the root listPerson element an xml:id value of all, then point to those as appropriate using who.
<u xml:id="ts_b1" trans="latching" who="#b">the election results? yes</u>
<u xml:id="ts_a2" trans="pause" who="#a">it's a disaster</u>
<u xml:id="ts_b2" trans="overlap" who="#b">it's a miracle</u>
An utterance may contain either running text, or text within which other basic structural elements are nested. Where such nesting occurs, the who attribute is considered to be inherited for the elements pause, vocal, shift and kinesic; that is, a pause or shift (etc.) within an utterance is regarded as being produced by that speaker only, while a pause between utterances applies to all speakers.
confident, he said, that the current economic problems will be
completely overcome by June<shift new="normal"/> what nonsense</u>
<incident>
<desc>reads aloud from newspaper</desc>
</incident> what
nonsense</u>
<desc>tut-tutting</desc>
</vocal> about it anyway?</u>
TEI: Pausing¶8.3.2 Pausing
- pause (pause) marks a pause either between or within utterances.
<pause dur="PT50S"/> with <pause dur="PT20S"/> um <pause dur="PT145S"/> you see
a tree okay?</u>
TEI: Vocal, Kinesic, Incident¶8.3.3 Vocal, Kinesic, Incident
The presence of non-transcribed semi-lexical or non-lexical phenomena either between or within utterances may be indicated with the following three elements.
- vocal (vocal) marks any vocalized but not necessarily lexical phenomenon, for example voiced pauses, non-lexical backchannels, etc.
- kinesic (kinesic) marks any communicative phenomenon, not necessarily vocalized, for example a gesture, frown, etc.
- incident (incident) marks any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication.
The who attribute should be used to specify the person or group responsible for a vocal, kinesic, or incident which is contained within an utterance, if this differs from that of the enclosing utterance. The attribute must be supplied for a vocal, kinesic, or incident which is not contained within an utterance.
The iterated attribute may be used to indicate that the vocal, kinesic, or incident is repeated, for example laughter as opposed to laugh. These should both be distinguished from laughing, where what is being encoded is a shift in voice quality. For this last case, the shift element discussed in section 8.3.6 Shifts should be used.
A child desc element may be used to supply a conventional representation for the phenomenon, for example:
- non-lexical
- burp, click, cough, exhale, giggle, gulp, inhale, laugh, sneeze, sniff, snort, sob, swallow, throat, yawn
- semi-lexical
- ah, aha, aw, eh, ehm, er, erm, hmm, huh, mm, mmhm, oh, ooh, oops, phew, tsk, uh, uh-huh, uh-uh, um, urgh, yup
Researchers may prefer to regard some semi-lexical phenomena as ‘words’ within the bounds of the u element. See further the discussion at section 8.4.3 Regularization of Word Forms below. As for all basic categories, the definition should be made clear in the encodingDesc element of the TEI header.
<incident>
<desc>telephone rings</desc>
</incident>
<u who="#ann">I'll get it</u>
<u who="#tom">I used to <vocal>
<desc>cough</desc>
</vocal> smoke a lot</u>
<u who="#bob">
<vocal>
<desc>sniffs</desc>
</vocal>He thinks he's tough
</u>
<vocal who="#ann">
<desc>snorts</desc>
</vocal>
<!-- ... elsewhere, e.g., in the <particDesc>: -->
<listPerson>
<person xml:id="ann">
<!-- ... -->
</person>
<person xml:id="bob">
<!-- ... -->
</person>
<person xml:id="jan">
<!-- ... -->
</person>
<person xml:id="kim">
<!-- ... -->
</person>
<person xml:id="tom">
<!-- ... -->
</person>
</listPerson>
The extent to which encoding of incidents or kinesics is included in a transcription will depend entirely on the purpose for which the transcription was made. As elsewhere, this will depend on the particular research agenda and the extent to which their presence is felt to be significant for the interpretation of spoken interactions.
TEI: Writing¶8.3.4 Writing
- writing (writing) contains a passage of written text revealed to participants in the course of a spoken text.
gradual indicates whether the writing is revealed all at once or gradually. - att.global.source provides an attribute used by elements to point to an external source.
<writing who="#a" type="newspaper"
gradual="false">Government claims economic problems
<soCalled>over by June</soCalled>
</writing>
<u who="#a">what nonsense!</u>
<!-- ...-->
<bibl xml:id="FOL1">Shakespeare First Folio text</bibl>
<bibl xml:id="FOL2">Shakespeare Second Folio text</bibl>
<!-- ...-->
</sourceDesc>
<!-- ...-->
<u>[...] now compare the punctuation of lines 12 and 14 in these two
versions of page 42...
<writing source="#FOL1">[...]</writing>
<writing source="#FOL2">[...]</writing>
</u>
TEI: Temporal Information¶8.3.5 Temporal Information
As noted above, utterances, vocals, pauses, kinesics, incidents, and writing elements all inherit attributes providing information about their position in time from the classes att.timed and att.duration. These attributes can be used to link parts of the transcription very exactly with points on a timeline, or simply to indicate their duration. Note that if start and end point to when elements whose temporal distance from each other is specified in a timeline, then dur is ignored.
The anchor element (see 16.5 Correspondence and Alignment) may be used as an alternative means of aligning the start and end of timed elements, and is required when the temporal alignment involves points within an element.
For further discussion of temporal alignment and synchronization see 8.4.2 Synchronization and Overlap below.
TEI: Shifts¶8.3.6 Shifts
- shift (shift) marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes.
feature a paralinguistic feature. Suggested values include: 1] tempo; 2] loud; 3] pitch; 4] tension; 5] rhythm; 6] voice new specifies the new state of the paralinguistic feature specified.
<shift feature="loud" new="f"/>Elizabeth
</u>
<u>Yes</u>
<u>
<shift feature="loud" new="normal"/>Come and try this <pause/>
<shift feature="loud" new="ff"/>come on
</u>
The values proposed here for the feature attribute are based on those used by the Survey of English Usage (see further Boase 1990); this list may be revised or supplemented using the methods outlined in section 23.3 Customization.
The new attribute specifies the new state of the feature following the shift. If this attribute has the special value normal, the implication is that the feature concerned ceases to be remarkable at this point.
A list of suggested values for each of the features proposed follows:
- tempo
- a
- allegro (fast)
- aa
- very fast
- acc
- accelerando (getting faster)
- l
- lento (slow)
- ll
- very slow
- rall
- rallentando (getting slower)
- loud (for loudness):
- f
- forte (loud)
- ff
- very loud
- cresc
- crescendo (getting louder)
- p
- piano (soft)
- pp
- very soft
- dimin
- diminuendo (getting softer)
- pitch (for pitch range):
- high
- high pitch-range
- low
- low pitch-range
- wide
- wide pitch-range
- narrow
- narrow pitch-range
- asc
- ascending
- desc
- descending
- monot
- monotonous
- scand
- scandent, each succeeding syllable higher than the last, generally ending in a falling tone
- tension:
- sl
- slurred
- lax
- lax, a little slurred
- ten
- tense
- pr
- very precise
- st
- staccato, every stressed syllable being doubly stressed
- leg
- legato, every syllable receiving more or less equal stress
- rhythm:
- rh
- beatable rhythm
- arrh
- arrhythmic, particularly halting
- spr
- spiky rising, with markedly higher unstressed syllables
- spf
- spiky falling, with markedly lower unstressed syllables
- glr
- glissando rising, like spiky rising but the unstressed syllables, usually several, also rise in pitch relative to each other
- glf
- glissando falling, like spiky falling but with the unstressed syllables also falling in pitch relative to each other
- voice (for voice quality):
- whisp
- whisper
- breath
- breathy
- husk
- husky
- creak
- creaky
- fals
- falsetto
- reson
- resonant
- giggle
- unvoiced laugh or giggle
- laugh
- voiced laugh
- trem
- tremulous
- sob
- sobbing
- yawn
- yawning
- sigh
- sighing
A full definition of the sense of the values provided for each feature may be provided either in the encoding description section of the text header (see section 2.3 The Encoding Description) or as part of a TEI customization, as described in section 23.3 Customization.
TEI: Elements Defined Elsewhere¶8.4 Elements Defined Elsewhere
This section describes the following features characteristic of spoken texts for which elements are defined elsewhere in these Guidelines:
- segmentation below the utterance level
- synchronization and overlap
- regularization of orthography
The elements discussed here are not provided by the module for spoken texts. Some of them are included in the core module and others are contained in the modules for linking and for analysis respectively. The selection of modules and their combination to define a TEI schema is discussed in section 1.2 Defining a TEI Schema.
TEI: Segmentation¶8.4.1 Segmentation
For some analytic purposes it may be desirable to subdivide the divisions of a spoken text into units smaller than the individual utterance or turn. Segmentation may be performed for a number of different purposes and in terms of a variety of speech phenomena. Common examples include units defined both prosodically (by intonation, pausing, etc.) and syntactically (clauses, phrases, etc.) The term macrosyntagm has been used by a number of researchers to define units peculiar to speech transcripts.36
These Guidelines propose that such analyses be performed in terms of neutrally-named segments, represented by the seg element, which is discussed more fully in section 16.3 Blocks, Segments, and Anchors. This element may take a type attribute to specify the kind of segmentation applicable to a particular segment, if more than one is possible in a text. A full definition of the segmentation scheme or schemes used should be provided in the segmentation element of the editorialDecl element in the TEI header (see 2.3.3 The Editorial Practices Declaration).
<seg>we went to the pub yesterday</seg>
<pause/>
<seg>there was no one there</seg>
</u>
<u>
<seg>although its an old ide´a</seg>
<seg>it hasnt been on the mar´ket very long</seg>
</u>
When utterances are segmented end-to-end in the same way as the s-units in written texts, the s element discussed in chapter 17 Simple Analytic Mechanisms may be used, either as an alternative or in addition to the more general purpose seg element. The s element is available without formality in all texts, but does not allow segments to nest within each other.
<seg type="C">I think </seg>
<seg type="C">this chap was writing </seg>
<seg type="C">and he <del type="repeated">said hello</del> said </seg>
<seg type="M">hello </seg>
<seg type="C">and he said </seg>
<seg type="C">I'm going to a gate
at twenty past seven </seg>
<seg type="C">he said </seg>
<seg type="M">ok </seg>
<seg type="M">right away </seg>
<seg type="C">and so <gap extent="1 syll"/> on they went </seg>
<seg type="C">and they were <gap extent="3 sylls"/>
writing there </seg>
</u>
type="C"
) or minor (type="M"
) units.<!-- ... -->
<seg type="C">and he said </seg>
<seg type="C">I'm going to a
<ext:paraphasia>gate</ext:paraphasia>
at twenty past seven </seg>
<!-- ... -->
</u>
This example also uses the core elements gap and del to mark editorial decisions concerning matter completely omitted from the transcript (because of inaudibility), and words which have been transcribed but which the transcriber wishes to exclude from the segment because they are repeated, respectively. See section 3.5 Simple Editorial Changes for a discussion of these and related elements.
It is often the case that the desired segmentation does not respect utterance boundaries; for example, syntactic units may cross utterance boundaries. For a detailed discussion of this problem, and the various methods proposed by these Guidelines for handling it, see chapter 20 Non-hierarchical Structures. Methods discussed there include these:
- ‘milestone’ tags may be used; the special-purpose shift tag discussed in section 8.3.6 Shifts is an extension of this method
- where several discontinuous segments are to be grouped together to form a syntactic unit (e.g. a phrasal verb with interposed complement), the join element may be used
TEI: Synchronization and Overlap¶8.4.2 Synchronization and Overlap
<person xml:id="stig">
<!-- ... -->
</person>
<person xml:id="lou">
<!-- ... -->
</person>
<person xml:id="jane">
<!-- ... -->
</person>
</listPerson>
<u xml:id="utt2" who="#stig">yes</u>
<kinesic xml:id="k1" who="#lou"
iterated="true">
<desc>nods head vertically</desc>
</kinesic>
For a full discussion of this and related mechanisms, section 16.4.2 Placing Synchronous Events in Time should be consulted. The rest of the present section, which should be read in conjunction with that more detailed discussion, presents a number of ways in which these mechanisms may be applied to the specific problem of representing temporal alignment, synchrony, or overlap in transcribing spoken texts.
In the simple example above, the first utterance (that with identifier utt1) contains an anchor element, the function of which is simply to mark a point within it. The synch attribute associated with this anchor point specifies the identifiers of the other two elements which are to be synchronized with it: specifically, the second utterance (utt2) and the kinesic (k1). Note that one of these elements has content and the other is empty.
This example demonstrates only a way of indicating a point within one utterance at which it can be synchronized with another utterance and a kinesic. For more complex kinds of alignment, involving possibly multiple synchronization points, an additional element is provided, known as a timeline. This consists of a series of when elements, each representing a point in time, and bearing attributes which indicate its exact temporal position relative to other elements in the same timeline, in addition to the sequencing implied by its position within it.
<when xml:id="TS-P1"
absolute="12:20:01+01:00"/>
<when xml:id="TS-P2" interval="4.5"
since="#TS-P1"/>
<when xml:id="TS-P6"/>
<when xml:id="TS-P3" interval="1.5"
since="#TS-P6"/>
</timeline>
One or more such timelines may be specified within a spoken text, to suit the encoder's convenience. If more than one is supplied, the origin attribute may be used on each to specify which other timeline element it follows. The unit attribute indicates the units used for timings given on when elements contained by the alignment map. Alternatively, to avoid the need to specify times explicitly, the interval attribute may be used to indicate that all the when elements in a time line are a fixed distance apart.
Three methods are available for aligning points or elements within a spoken text with the points in time defined by the timeline:
- The elements to be synchronized may specify the identifier of a when element as the value of one of the start, end, or synch attributes
- The when element may specify the identifiers of all the elements to be synchronized with it using the synch attribute
- A free-standing link element may be used to associate the when element and the elements synchronized with it by specifying their identifiers as values for its target attribute.
end="#TS-P3">This is my <anchor synch="#TS-P6" xml:id="TS-P6A"/> turn</u>
<when xml:id="ts-p1"
absolute="12:20:01+01:00"/>
<when synch="#ts-u1" xml:id="ts-p2"
interval="4.5" since="#ts-p1"/>
<when synch="#ts-x1" xml:id="ts-p6"/>
<when synch="#ts-u1" xml:id="ts-p3"
interval="1.5" since="#ts-p6"/>
</timeline>
<u xml:id="ts-u1">This is my <anchor xml:id="ts-x1"/> turn</u>
<when xml:id="TS-p1" absolute="12:20:01"/>
<when xml:id="TS-p2" interval="4.5"
since="#TS-p1"/>
<when xml:id="TS-p6"/>
<when xml:id="TS-p3" interval="1.5"
since="#TS-p6"/>
</timeline>
<u xml:id="TS-u1">
<anchor xml:id="TS-u1start"/>
This is my <anchor xml:id="TS-x1"/> turn
<anchor xml:id="TS-u1end"/>
</u>
<linkGrp type="synchronous">
<link target="#TS-u1start #TS-p1"/>
<link target="#TS-u1end #TS-p2"/>
<link target="#TS-x1 #TS-p6"/>
</linkGrp>
<anchor xml:id="TS-p20"/>but I never inhaled the smoke</u>
<u start="#TS-p10" end="#TS-p20" who="#bob">You used to smoke</u>
<anchor synch="#TS-p10"/>You used to smoke<anchor synch="#TS-p20"/>
</u>
<when xml:id="TS-t01" absolute="15:33:01Z"/>
<when xml:id="TS-t02" interval="2.5"
since="#TS-t01"/>
</timeline>
<u who="#tom">I used to smoke
<anchor synch="#TS-t01"/>a lot more than this
<anchor synch="#TS-t02"/>but I never inhaled the smoke</u>
<u who="#bob">
<anchor synch="#TS-t01"/>You used to smoke<anchor synch="#TS-t02"/>
</u>
<when synch="#TS-nm1 #bob-u2"
xml:id="TS-T01"/>
<when synch="#TS-nm2 #bob-u2"
xml:id="TS-T02"/>
</timeline>
<u who="#tom">I used to smoke
<anchor xml:id="TS-nm1"/>a lot more than this
<anchor xml:id="TS-nm2"/>but I never inhaled the smoke</u>
<u xml:id="bob-u2" who="#bob">You used to smoke</u>
<timeline origin="#T001">
<when xml:id="T001"/>
<when xml:id="T002"/>
</timeline>
<u who="#tom">I used to smoke
<anchor xml:id="NM01"/>a lot more than this
<anchor xml:id="NM02"/>but I never inhaled the smoke</u>
<u xml:id="bob-U2" who="#bob">You used to smoke</u>
<linkGrp type="synchronize">
<link target="#T001 #NM01 #bob-U2"/>
<link target="#T002 #NM02 #bob-U2"/>
</linkGrp>
</body>
Note that in each case, although Bob's utterance follows Tom's sequentially in the text, it is aligned temporally with its middle, without any need to disrupt the normal syntax of the text.
<when synch="#TSa1 #TSb1 #TSc1"
xml:id="TSp1"/>
<when synch="#TSa2 #TSc2" xml:id="TSp2"/>
</timeline>
<!-- ... -->
<u who="#stig">this is <anchor xml:id="TSa1"/> my <anchor xml:id="TSa2"/> turn</u>
<u who="#jane" xml:id="TSb1">balderdash</u>
<u who="#lou" xml:id="TSc1"> no <anchor xml:id="TSc2"/> it's mine</u>
TEI: Regularization of Word Forms¶8.4.3 Regularization of Word Forms
When speech is transcribed using ordinary orthographic notation, as is customary, some compromise must be made between the sounds produced and conventional orthography. Particularly when dealing with informal, dialectal, or other varieties of language, the transcriber will frequently have to decide whether a particular sound is to be treated as a distinct vocabulary item or not. For example, while in a given project kinda may not be worth distinguishing as a vocabulary item from kind of, isn't may clearly be worth distinguishing from is not; for some purposes, the regional variant isnae might also be worth distinguishing in the same way.
One rule of thumb might be to allow such variation only where a generally accepted orthographic form exists, for example, in published dictionaries of the language register being encoded; this has the disadvantage that such dictionaries may not exist. Another is to maintain a controlled (but extensible) set of normalized forms for all such words; this has the advantage of enforcing some degree of consistency among different transcribers. Occasionally, as for example when transcribing abbreviations or acronyms, it may be felt necessary to depart from conventional spelling to distinguish between cases where the abbreviation is spelled out letter by letter (e.g. B B C or V A T) and where it is pronounced as a single word (VAT or RADA). Similar considerations might apply to pronunciation of foreign words (e.g. Monsewer vs. Monsieur).
In general, use of punctuation, capitalization, etc., in spoken transcripts should be carefully controlled. It is important to distinguish the transcriber's intuition as to what the punctuation should be from the marking of prosodic features such as pausing, intonation, etc.
Whatever practice is adopted, it is essential that it be clearly and fully documented in the editorial declarations section of the header. It may also be found helpful to include normalized forms of non-conventional spellings within the text, using the elements for simple editorial changes described in section 3.5 Simple Editorial Changes (see further section 8.4.5 Speech Management).
TEI: Prosody¶8.4.4 Prosody
In the absence of conventional punctuation, the marking of prosodic features assumes paramount importance, since these structure and organize the spoken message. Indeed, such prosodic features as points of primary or secondary stress may be represented by specialized punctuation marks, or other characters such as those provided by the Unicode Spacing Modifier Letters block. Pauses have already been dealt with in section 8.3.2 Pausing; while tone units (or intonational phrases) can be indicated by the segmentation tag discussed in section 8.4.1 Segmentation. The shift element discussed in section 8.3.6 Shifts may also be used to encode some prosodic features, for example where all that is required is the ability to record shifts in voice quality.
In a more detailed phonological transcript, it is common practice to include a number of conventional signs to mark prosodic features of the surrounding or (more usually) preceding speech. Such signs may be used to record, for example, particular intonation patterns, truncation, vowel quality (long or short) etc. These signs may be preserved in a transcript either by using conventional punctuation or by marking their presence by g elements. Where a transcript includes many phonetic or phonemic aspects, it will generally be more convenient to use the appropriate Unicode characters (see further chapters vi. Languages and Character Sets and 5 Characters, Glyphs, and Writing Modes). For representation of phonemic information, the use of the International Phonetic Alphabet, which can be represented in Unicode characters, is recommended.
<char xml:id="lf">
<desc>low fall intonation</desc>
</char>
<char xml:id="lr">
<desc>low rise intonation</desc>
</char>
<char xml:id="fr">
<desc>fall rise intonation</desc>
</char>
<char xml:id="rf">
<desc>rise fall intonation</desc>
</char>
<char xml:id="long">
<desc>lengthened syllable</desc>
</char>
<char xml:id="short">
<desc>shortened syllable</desc>
</char>
</charDecl>
<!-- ... in the <particDesc>: --><listPerson>
<person xml:id="cwn">
<p>Customer WN</p>
</person>
<person xml:id="aj">
<p>Assistant K</p>
</person>
</listPerson>
<!-- ... within the <text>: -->
<div n="Lod E-03" type="exchange">
<note>C is with a friend</note>
<u who="#cwn">
<unclear>Excuse me<g ref="#lf"/>
</unclear>
<pause/> You dont have some
aesthetic<g ref="#short"/>
<pause/>
<unclear>specially on early</unclear>
aesthetics terminology <g ref="#lr"/>
</u>
<u who="#aj"> No<g ref="#lf"/>
<pause/>No<g ref="#lf"/>
<gap extent="2 beats"/> I'm
afraid<g ref="#lf"/>
</u>
<u trans="latching" who="#cwn"> No<g ref="#lr"/>
<unclear>Well</unclear> thanks<g ref="#lr"/>
<pause/> Oh<g ref="#short"/>
<unclear>you couldnt<g ref="#short"/> can we</unclear> kind of<g ref="#long"/>
<pause/>I mean ask you to order it for us<g ref="#long"/>
<g ref="#fr"/>
</u>
<u trans="latching" who="#aj"> Yes<g ref="#fr"/> if you know the title<g ref="#lf"/> Yeah<g ref="#lf"/>
</u>
<u who="#cwn">
<gap extent="4 beats"/>
</u>
<u who="#aj"> Yes thats fine. <unclear>just as soon as it comes in we'll send
you a postcard<g ref="#lf"/>
</unclear>
</u>
</div>
This example, which is taken from a corpus of bookshop service encounters, also demonstrates the use of the unclear and gap elements discussed in section 3.5 Simple Editorial Changes. Where words are so unclear that only their extent can be recorded, the empty gap element may be used; where the encoder can identify the words but wishes to record a degree of uncertainty about their accuracy, the unclear element may be used. More flexible and detailed methods of indicating uncertainty are discussed in chapter 21 Certainty, Precision, and Responsibility.
For more detailed work, involving a detailed phonological transcript including representation of stress and pitch patterns, it is probably best to maintain the prosodic description in parallel with the conventional written transcript, rather than attempt to embed detailed prosodic information within it. The two parallel streams may be aligned with each other and with other streams, for example an acoustic encoding, using the general alignment mechanisms discussed in section 8.3.6 Shifts.
TEI: Speech Management¶8.4.5 Speech Management
Phenomena of speech management include disfluencies such as filled and unfilled pauses, interrupted or repeated words, corrections, and reformulations as well as interactional devices asking for or providing feedback. Depending on the importance attached to such features, transcribers may choose to adopt conventionalized representations for them (as discussed in section 8.4.3 Regularization of Word Forms above), or to transcribe them using IPA or some other transcription system. To simplify analysis of the lexical features of a speech transcript, it may be felt useful to ‘tidy away’ many of these disfluencies. Where this policy has been adopted, these Guidelines recommend the use of the tags for simple editorial intervention discussed in section 3.5 Simple Editorial Changes, to make explicit the extent of regularization or normalization performed by the transcriber.
<del type="truncation">s</del>see
<del type="repetition">you you</del> you know
<del type="falseStart">it's</del> he's crazy
</u>
unit="s"/>
<pause dur="PT1S"/> vielleicht </foreign> go to warsaw
and <emph>vienna</emph>
</u>
TEI: Analytic Coding¶8.4.6 Analytic Coding
The recommendations made here only concern the establishment of a basic text. Where a more sophisticated analysis is needed, more sophisticated methods of markup will also be appropriate, for example, using stand-off markup to indicate multiple segmentation of the stream of discourse, or complex alignment of several segments within it. Where additional annotations (sometimes called ‘codes’ or ‘tags’) are used to represent such features as linguistic word class (noun, verb, etc.), type of speech act (imperative, concessive, etc.), or information status (theme/rheme, given/new, active/semi-active/new), etc., a selection from the general purpose analytic tools discussed in chapters 16 Linking, Segmentation, and Alignment, 17 Simple Analytic Mechanisms, and 18 Feature Structures may be used to advantage.
The general-purpose annotationBlock element may be used to group together a transcription and multiple layers of annotation. It also serves to divide a transcribed text up into meaningful analytic sections.
- annotationBlock groups together various annotations, e.g. for parallel interpretations of a spoken segment.
TEI: Module for Transcribed Speech¶8.5 Module for Transcribed Speech
The module described in this chapter makes available the following components:
- Module spoken: Transcribed Speech
- Elements defined: annotationBlock broadcast equipment incident kinesic pause recording recordingStmt scriptStmt shift transcriptionDesc u vocal writing
- Classes defined: att.duration model.divPart.spoken model.global.spoken model.recordingPart
The selection and combination of modules to form a TEI schema is described in 1.2 Defining a TEI Schema.