CE W 05: Semantics for characters and linguistic features

Overview
Features in the old TEI-WSD
Features in the Eric Albright's paper Design of an electronic method for describing writing systems
Evaluation

Overview

This document attempts to enumerate features and categories that are needed for a writing system declaration. At this time (2002-08-26), it is still in a rather rough state.

Features in the ‘ old ’ TEI-WSD

This list enumerates some of the features that are available in the WSD as of P4, we need to think about which of them to retain for P5.

language
script
direction
characters
exceptions; details for characters (form, desc), are listed here. ¹
note

Features in the Eric Albright's paper Design of an electronic method for describing writing systems

In chapter 6 of his thesis, Albright discusses the following ‘Elements of writing systems’:

6.1. Linguistic elements
Link to the linguistic description. Elements <linguistic-unit> , <sequence> (container for sequential units), containing <linguistic-unitRef> .
6.2. Graphs
discrete segments, <graph> , contains: <name>
6.3. Graphemes
graphemes, if such a thing exists: <grapheme>
6.4. Writing system units
higher level units: characters, syllables, words, phrases, sentences: <writing-unit> (do we need this formal description?)
6.5. Classes
<class> . Classes might be better built by enumerating the feature on the graphs. But, admittedly, there is a level of abstract description that might be difficult to achieve otherwise.
6.6. Computational units
6.6.1. Key codes
we do not worry about this.
6.6.2. Coded units
we do not worry about this.
6.6.3. Glyphs
this is meant to reference to the glyphindex of a font. This is a low-level feature, that is usually not available in text processing (where glyphs in fonts are adressed through cmap tables by character code.
7. correspondence rules ²

Evaluation

The TEI WSD and Albright's EWSD have surprisingly little overlap. The TEI WSD is mainly a mechanism to define a mapping between various legacy coded character sets and the Universal Character Set (UCS). EWSD on the other hand starts with a tabula rasa and enumerates all information that it deems useful for the electronic processing of a writing system.

In the following sections, an attempt will be made to evaluate what of the information from these earlier attempts at defining a WSD should be retained for the ‘TEI WSD-NG’. This will be split in three different parts:

the definition of a new character
definition of semantics for the characters
definition of linguistic properties for features of a writing system.

Definition of new characters

In the ‘TEI WSD-NG’ we will assume the document encoding to be the UCS, therefore in most cases no mapping is needed. There are however some special cases, where such a mapping still might be required:

Only a subset (a legacy CCS like Big5) of the UCS is used. The ‘TEI WSD-NG’ will need to be able to map additional UCS characters (that were not available in the subset, but are in fact in UCS)
The document uses an older version of the UCS. Characters that did not have a UCS mapping at the time the document was created, might have such a mapping now. In such cases, one convenient way to make this known to XML processors is the WSD-NG.

The various strategies for defining a new character is the subject of work paper CE W 02. Here we will simple assume that such a mechanism is in place.

Semantics for characters

‘TEI WSD-NG’ will need a flexible way to define (for new characters defined with the above mechanism) or overlay (for existing characters) the semantics of a character. What kind of semantics do we need?

To answer this question, it might be useful to first have a look at the character semantics that are defined by The Unicode Standard. The Unicode Standard divides character semantics into two categories, normative and informative (See The Unicode Standard, Version 3.0, Addison and Wesley, Chapter 4). The normative properties are listed as follows as of The Unicode Standard 3.0

Normative Character Properties in Unicode (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-1).

Case
Combining Classes
Conjoining Jamo (110011FF)
Decomposition (Canonical and Compatibility)
Directionality
Jamo Short Name
Numeric Value
Private Use
Special Character Properties
Surrogate
Mirrored
Unicode Character Names

According to The Unicode Standard, Version 3.0, Addison and Wesley Section 3.4, p.42, Case, Numeric Value , Directionality and Mirrored are designated ‘simple character properties.’

The informative properties as of The Unicode Standard 3.0 are given as follows:

Informative Character Properties (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-2).

Case Mapping
Dashes
East Asian Width
Letters (Alphabetic and Ideographic)
Line Breaking
Mathematical Property
Spaces
Unicode 1.0 Names

In addition to these lists, there are also ‘special character properties’ as enumerated in The Unicode Standard, Version 3.0, Addison and Wesley Section p.47. These are

Line boundary control
Hyphenation control
Fraction formatting
Special behavior with nonspacing marks
Double nonspacing marks
Joining
Bidirectional ordering
Alternate formatting
Syriac abbreviation
Indic dead-character formation
Mongolian variant selectors
Ideographic variation indication
Ideographic description
Interlinear annotation
Object replacement
Code conversion fallback
Byte order signature

Furthermore, there is the ‘General Category’, which is defined for all Unicode characters and assigns them to some general classes. This general category does in part overlap with the above listed character properties.

What strategy should be choosen to deal with Unicode character properties in ‘TEI WSD-NG’? It might be useful to hardcode the definition skeleton (that is, the key for the definition is predefined, only the value needs to be given) for some of the normative properties into the ‘TEI WSD-NG’, while others, such as the informative properties might be better served with a freeform definition skeleton (that is, a key/value pair can be freely defined). Additionally, we will need to provide for the possibility to define arbitrary additional properties -- do we need to make this syntactical different from Unicode Character Properties? It might be useful for a processor to be able to recognize Unicode Character Properties, but OTOH a specific attribute-value could also provide for this.

In addition to the Unicode properties, a number of other properties are required frequently and should be given predefined definition skeletons:

Graphical appearance
Pronounciation
Name of a font and codepoint(UCS, possibly PUA) within the font that contains the character
Mappings to non-UCS or private character repertoires, or other reference systems
Standard orthographic form of the character(s)

Definition of linguistic features

This is largely covered in Albright's thesis. It might best be served by a third, separate module of the ‘TEI WSD-NG’ to be used only where needed.

Notes

it seems, there is no way to specify properties like case equivalents. Character classes can be assigned to the character element, but there is only a very limited set of possible classes.

do we need this for text encoding?// case relationship, collation sequence is needed!?

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org