Text Encoding Initiative

11. Names, Dates, Numbers and Abbreviations


The TEI scheme defines elements for a large number of `data-like' features which may appear almost anywhere within almost any kind of text. These features may be of particular interest in a range of disciplines; they all relate to objects external to the text itself, such as the names of persons and places, numbers and dates. They also pose particular problems for many natural language processing (NLP) applications because of the variety of ways in which they may be presented within a text. The elements described here, by making such features explicit, reduce the complexity of processing texts containing them.

11.1. Names and Referring Strings

A referring string is a phrase which refers to some person, place, object, etc. Two elements are provided to mark such strings:

<rs>
contains a general purpose name or referring string. Attributes include:

type
indicates more specifically the object referred to by the referencing string. Values might include person, place, ship, element, etc.

<name>
contains a proper noun or noun phrase. Attributes include:

type
indicates the type of the object which is being named by the phrase.

The type attribute is used to distinguish amongst (for example) names of persons, places and organizations, where this is possible:

<q>My dear <rs type="person">Mr. Bennet</rs>, </q>
said his lady to him one day, <q>have you heard
that <rs type="place">Netherfield Park</rs> is let
at last?</q>
It being one of the principles of the
<rs type="organization">Circumlocution Office</rs> never,
on any account whatsoever, to give a straightforward answer,
<rs type="person">Mr Barnacle</rs> said, <q>Possibly.</q>

As the following example shows, the <rs> element may be used for any reference to a person, place, etc, not necessarily one in the form of a proper noun or noun phrase.

<q>My dear <rs type="person">Mr. Bennet</rs>,</q>
said <rs type="person">his lady</rs> to him
one day...

The <name> element by contrast is provided for the special case of referencing strings which consist only of proper nouns; it may be used synonymously with the <rs> element, or nested within it if a referring string contains a mixture of common and proper nouns.

Simply tagging something as a name is generally not enough to enable automatic processing of personal names into the canonical forms usually required for reference purposes. The name as it appears in the text may be inconsistently spelled, partial, or vague. Moreover, name prefixes such as van or de la, may or may not be included as part of the reference form of a name, depending on the language and country of origin of the bearer.

The following attributes are also available for these and similar elements to help overcome these difficulties:

key
provides an alternative identifier for the object being named, such as a database record key.
reg
gives a normalized or regularized form of the name used.

The key attribute may be useful as a means of gathering together all references to the same individual or location scattered throughout a document:

  <q>My dear <rs type="person" key="BENM1">Mr. Bennet</rs>,
  </q> said <rs type="person" key="BENM2">his lady</rs>
  to him one day, <q>have you heard that
  <rs type="place" key="NETP1">Netherfield Park</rs>
  is let at last?</q>

This use should be distinguished from the case of the reg (regularization) attribute, which provides a means of marking the standard form of a referencing string as demonstrated below:

  <name type="person" key="WADLM1" reg="de la Mare, Walter">
     Walter de la Mare</name> was born at
  <name key="Ch1" type="place">Charlton</name>, in
  <name key="KT1" type="county">Kent</name>, in 1873.

More detailed tagging of the components of proper names is also possible, using the additional tag set for names and dates.

11.2. Dates and Times

Tags for the more detailed encoding of times and dates include the following:

<date>
contains a date in any format. Attributes include:

calendar
indicates the system or calendar to which the date belongs.
value
gives the value of the date in some standard form, usually yyyy-mm-dd.

<time>
contains a phrase defining a time of day in any format. Attributes include:

value
gives the value of the time in a standard form.

The value attribute specifies a normalized form for the date or time, using a recognized format such as ISO 8601. Partial dates or times (e.g. ‘1990’, ‘September 1990’, ‘twelvish’) can usually be expressed by simply omitting a part of the value supplied; alternatively imprecise dates or times (for example ‘early August’, ‘some time after ten and before twelve’) may be expressed as date or time ranges. If either end of the date or time range is known to be accurate, (for example, ‘at some time before 1230’, ‘a few days after Hallowe'en’) the exact attribute may be used to specify this.

Examples:

<date value="1980-02-21">21 Feb 1980</date>
<date value="1990">1990</date>
<date value="1990-09">September 1990</date>
Given on the <date value="1977-06-12">Twelfth Day of June
in the Year of Our Lord One Thousand Nine Hundred and
Seventy-seven of the Republic the Two Hundredth and first
and of the University the Eighty-Sixth.</date>
<l>specially when it's nine below zero</l>
<l>and <time value="15:00">three o'clock in the
       afternoon</time></l>

11.3. Numbers

Numbers can be written with either letters or digits (twenty-one, xxi, and 21) and their presentation is language-dependent (e.g. English 5th becomes Greek 5.; English 123,456.78 equals French 123.456,78). In natural-language processing or machine-translation applications, it is often helpful to distinguish them from other, more `lexical' parts of the text. In other applications, the ability to record a number's value in standard notation is important. The <num> element provides this possibility:

<num>
contains a number, written in any form. Attributes include:

type
indicates the type of numeric value. Suggested values include: fraction, ordinal (for ordinal numbers, e.g. ‘21st’), percentage, and cardinal (an absolute number, e.g. ‘21’, ‘21.5’, etc.)
value
supplies the value of the number in an application-dependent standard form.

For example:

<num value="33">xxxiii</num>
<num type="cardinal" value="21">twenty-one</num>
<num type="percentage" value="10">ten percent</num>
<num type="percentage" value="10">10%</num>
<num type="ordinal" value="5">5th</num>

11.4. Abbreviations and their Expansion

Like names, dates, and numbers, abbreviations may be transcribed as they stand or expanded; they may be left unmarked, or encoded using the following element:

<abbr>
contains an abbreviation of any sort. Attributes include:

expan
gives an expansion of the abbreviation.
type
allows the encoder to classify the abbreviation according to some convenient typology. Sample values include contraction, suspension, brevigraph, superscription, or acronym. The type attribute may also be given values like title (for titles of address), geographic, organization, etc., describing the nature of the object referred to.

The <abbr> element is useful as a means of distinguishing semi-lexical items such as acronyms or jargon:

We can sum up the above discussion as follows:  the identity of a
<abbr>CC</abbr> is defined by that calibration of values which
motivates the elements of its <abbr>GSP</abbr>;
Every manufacturer of <abbr>3GL</abbr> or <abbr>4GL</abbr>
languages is currently nailing on <abbr>OOP</abbr> extensions

The type attribute may be used to distinguish types of abbreviation by their function, and the expan attribute may be used to supply an expansion:

 <name><abbr type="title" expan="Doctor">Dr.</abbr>
 <abbr type="initial" expan="Marilyn">M.</abbr>
 Deegan</name>
 is the Director of the
 <abbr expan="Computers in Teaching Initiative" type="acronym">
 CTI</abbr> Centre for Textual Studies.

This element is also particularly useful where manuscript materials in which abbreviation is very frequent are being transcribed.

11.5. Addresses

The <address> element is used to mark a postal address of any kind. It contains one or more <addrLine> elements, one for each line of the address.

address
contains a postal or other address, for example of a publisher, an organization, or an individual.
addrLine
contains one line of a postal or other address.

Here is a simple example:

<address>
<addrLine>Computer Center (M/C 135)</addrLine>
<addrLine>1940 W. Taylor, Room 124</addrLine>
<addrLine>Chicago, IL 60612-7352</addrLine>
<addrLine>U.S.A.</addrLine>
</address>

The individual parts of an address may be further distinguished by using the <name> element discussed above (section 11.1. Names and Referring Strings).

<address>
<addrLine>Computer Center (M/C 135)</addrLine>
<addrLine>1940 W. Taylor, Room 124</addrLine>
<addrLine><name type="city">Chicago</name>, IL 60612-7352</addrLine>
<addrLine><name type="country">USA</name></addrLine>
</address>

Up: Contents Previous: 10. Omissions, Deletions, and Additions Next: 12. Lists



Date: (revised October 2004) Author: Lou Burnard (revised SPQR).
Copyright TEI 1995