Text Encoding Initiative |
|
11. Names, Dates, Numbers and Abbreviations |
The TEI scheme defines elements for a large number of `data-like' features which may appear almost anywhere within almost any kind of text. These features may be of particular interest in a range of disciplines; they all relate to objects external to the text itself, such as the names of persons and places, numbers and dates. They also pose particular problems for many natural language processing (NLP) applications because of the variety of ways in which they may be presented within a text. The elements described here, by making such features explicit, reduce the complexity of processing texts containing them.
A referring string is a phrase which refers to some person, place, object, etc. Two elements are provided to mark such strings:
The type attribute is used to distinguish amongst (for example) names of persons, places and organizations, where this is possible:
<q>My dear <rs type="person">Mr. Bennet</rs>, </q> said his lady to him one day, <q>have you heard that <rs type="place">Netherfield Park</rs> is let at last?</q>
It being one of the principles of the <rs type="organization">Circumlocution Office</rs> never, on any account whatsoever, to give a straightforward answer, <rs type="person">Mr Barnacle</rs> said, <q>Possibly.</q>
As the following example shows, the <rs> element may be used for any reference to a person, place, etc, not necessarily one in the form of a proper noun or noun phrase.
<q>My dear <rs type="person">Mr. Bennet</rs>,</q> said <rs type="person">his lady</rs> to him one day...
The <name> element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the <rs> element, or nested within it if a referring string contains a mixture of common and proper nouns.
Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as van or de la, may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.
The following attributes are also available for these and similar elements to help overcome these difficulties:
The key attribute may be useful as a means of gathering together all references to the same individual or location scattered throughout a document:
<q>My dear <rs type="person" key="BENM1">Mr. Bennet</rs>, </q> said <rs type="person" key="BENM2">his lady</rs> to him one day, <q>have you heard that <rs type="place" key="NETP1">Netherfield Park</rs> is let at last?</q>
This use should be distinguished from the case of the reg (regularization) attribute, which provides a means of marking the standard form of a referencing string as demonstrated below:
<name type="person" key="WADLM1" reg="de la Mare, Walter"> Walter de la Mare</name> was born at <name key="Ch1" type="place">Charlton</name>, in <name key="KT1" type="county">Kent</name>, in 1873.
More detailed tagging of the components of proper names is also possible, using the additional tag set for names and dates.
Tags for the more detailed encoding of times and dates include the following:
The value attribute specifies a normalized form for the date or time, using a recognized format such as ISO 8601. Partial dates or times (e.g. ‘1990’, ‘September 1990’, ‘twelvish’) can usually be expressed by simply omitting a part of the value supplied; alternatively imprecise dates or times (for example ‘early August’, ‘some time after ten and before twelve’) may be expressed as date or time ranges. If either end of the date or time range is known to be accurate, (for example, ‘at some time before 1230’, ‘a few days after Hallowe'en’) the exact attribute may be used to specify this.
<date value="1980-02-21">21 Feb 1980</date> <date value="1990">1990</date> <date value="1990-09">September 1990</date>
Given on the <date value="1977-06-12">Twelfth Day of June in the Year of Our Lord One Thousand Nine Hundred and Seventy-seven of the Republic the Two Hundredth and first and of the University the Eighty-Sixth.</date>
<l>specially when it's nine below zero</l> <l>and <time value="15:00">three o'clock in the afternoon</time></l>
Numbers can be written with either letters or digits (twenty-one, xxi, and 21) and their presentation is language-dependent (e.g. English 5th becomes Greek 5.; English 123,456.78 equals French 123.456,78). In natural-language processing or machine-translation applications, it is often helpful to distinguish them from other, more `lexical' parts of the text. In other applications, the ability to record a number's value in standard notation is important. The <num> element provides this possibility:
<num value="33">xxxiii</num> <num type="cardinal" value="21">twenty-one</num> <num type="percentage" value="10">ten percent</num> <num type="percentage" value="10">10%</num> <num type="ordinal" value="5">5th</num>
Like names, dates, and numbers, abbreviations may be transcribed as they stand or expanded; they may be left unmarked, or encoded using the following element:
The <abbr> element is useful as a means of distinguishing semi-lexical items such as acronyms or jargon:
We can sum up the above discussion as follows: the identity of a <abbr>CC</abbr> is defined by that calibration of values which motivates the elements of its <abbr>GSP</abbr>;
Every manufacturer of <abbr>3GL</abbr> or <abbr>4GL</abbr> languages is currently nailing on <abbr>OOP</abbr> extensions
The type attribute may be used to distinguish types of abbreviation by their function, and the expan attribute may be used to supply an expansion:
<name><abbr type="title" expan="Doctor">Dr.</abbr> <abbr type="initial" expan="Marilyn">M.</abbr> Deegan</name> is the Director of the <abbr expan="Computers in Teaching Initiative" type="acronym"> CTI</abbr> Centre for Textual Studies.
This element is also particularly useful where manuscript materials in which abbreviation is very frequent are being transcribed.
The <address> element is used to mark a postal address of any kind. It contains one or more <addrLine> elements, one for each line of the address.
<address> <addrLine>Computer Center (M/C 135)</addrLine> <addrLine>1940 W. Taylor, Room 124</addrLine> <addrLine>Chicago, IL 60612-7352</addrLine> <addrLine>U.S.A.</addrLine> </address>
The individual parts of an address may be further distinguished by using the <name> element discussed above (section 11.1. Names and Referring Strings).
<address> <addrLine>Computer Center (M/C 135)</addrLine> <addrLine>1940 W. Taylor, Room 124</addrLine> <addrLine><name type="city">Chicago</name>, IL 60612-7352</addrLine> <addrLine><name type="country">USA</name></addrLine> </address>
Up: Contents Previous: 10. Omissions, Deletions, and Additions Next: 12. Lists