CE W 05: Semantics for characters and linguistic features
Contents
- Overview
- Features in the old TEI-WSD
- Features in the Eric Albright's paper Design of an electronic method for describing writing systems
- Evaluation
Overview
This document attempts to enumerate features and categories that are needed for a writing system declaration. At this time (2002-08-26), it is still in a rather rough state.
Features in the ‘ old ’ TEI-WSD
- language
- script
- direction
-
characters
exceptions; details for characters (form, desc), are listed here. 1
- note
Features in the Eric Albright's paper Design of an electronic method for describing writing systems
-
6.1. Linguistic elements
Link to the linguistic description. Elements <linguistic-unit> , <sequence> (container for sequential units), containing <linguistic-unitRef> .
-
6.2. Graphs
discrete segments, <graph> , contains: <name>
-
6.3. Graphemes
graphemes, if such a thing exists: <grapheme>
-
6.4. Writing system units
higher level units: characters, syllables, words, phrases, sentences: <writing-unit> (do we need this formal description?)
-
6.5. Classes
<class> . Classes might be better built by enumerating the feature on the graphs. But, admittedly, there is a level of abstract description that might be difficult to achieve otherwise.
- 6.6. Computational units
-
6.6.1. Key codes
we do not worry about this.
-
6.6.2. Coded units
we do not worry about this.
-
6.6.3. Glyphs
this is meant to reference to the glyphindex of a font. This is a low-level feature, that is usually not available in text processing (where glyphs in fonts are adressed through cmap tables by character code.
- 7. correspondence rules 2
Evaluation
The TEI WSD and Albright's EWSD have surprisingly little overlap. The TEI WSD is mainly a mechanism to define a mapping between various legacy coded character sets and the Universal Character Set (UCS). EWSD on the other hand starts with a tabula rasa and enumerates all information that it deems useful for the electronic processing of a writing system.
- the definition of a new character
- definition of semantics for the characters
- definition of linguistic properties for features of a writing system.
Definition of new characters
- Only a subset (a legacy CCS like Big5) of the UCS is used. The ‘TEI WSD-NG’ will need to be able to map additional UCS characters (that were not available in the subset, but are in fact in UCS)
- The document uses an older version of the UCS. Characters that did not have a UCS mapping at the time the document was created, might have such a mapping now. In such cases, one convenient way to make this known to XML processors is the WSD-NG.
The various strategies for defining a new character is the subject of work paper CE W 02. Here we will simple assume that such a mechanism is in place.
Semantics for characters
‘TEI WSD-NG’ will need a flexible way to define (for new characters defined with the above mechanism) or overlay (for existing characters) the semantics of a character. What kind of semantics do we need?
Normative Character Properties in Unicode (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-1).
- Case
- Combining Classes
- Conjoining Jamo (110011FF)
- Decomposition (Canonical and Compatibility)
- Directionality
- Jamo Short Name
- Numeric Value
- Private Use
- Special Character Properties
- Surrogate
- Mirrored
- Unicode Character Names
Informative Character Properties (see The Unicode Standard, Version 3.0, Addison and Wesley, p. 73, Table 4-2).
- Case Mapping
- Dashes
- East Asian Width
- Letters (Alphabetic and Ideographic)
- Line Breaking
- Mathematical Property
- Spaces
- Unicode 1.0 Names
- Line boundary control
- Hyphenation control
- Fraction formatting
- Special behavior with nonspacing marks
- Double nonspacing marks
- Joining
- Bidirectional ordering
- Alternate formatting
- Syriac abbreviation
- Indic dead-character formation
- Mongolian variant selectors
- Ideographic variation indication
- Ideographic description
- Interlinear annotation
- Object replacement
- Code conversion fallback
- Byte order signature
What strategy should be choosen to deal with Unicode character properties in ‘TEI WSD-NG’? It might be useful to hardcode the definition skeleton (that is, the key for the definition is predefined, only the value needs to be given) for some of the normative properties into the ‘TEI WSD-NG’, while others, such as the informative properties might be better served with a freeform definition skeleton (that is, a key/value pair can be freely defined). Additionally, we will need to provide for the possibility to define arbitrary additional properties -- do we need to make this syntactical different from Unicode Character Properties? It might be useful for a processor to be able to recognize Unicode Character Properties, but OTOH a specific attribute-value could also provide for this.
- Graphical appearance
- Pronounciation
- Name of a font and codepoint(UCS, possibly PUA) within the font that contains the character
- Mappings to non-UCS or private character repertoires, or other reference systems
- Standard orthographic form of the character(s)
Definition of linguistic features
This is largely covered in Albright's thesis. It might best be served by a third, separate module of the ‘TEI WSD-NG’ to be used only where needed.