5 Characters, Glyphs, and Writing Modes

Chapter vi. Languages and Character Sets introduced the fundamental notions of language identification and character representation in an encoded TEI document. In this chapter we discuss some additional issues relating to the way that written language is represented in a TEI document. In sections 5.1 Is Your Journey Really Necessary? and 5.2 Markup Constructs for Representation of Characters and Glyphs we introduce markup which may be used to represent and document non-standard characters, that is, written symbols for which no codepoint exists in Unicode. The same markup may be used to annotate existing characters according to their visual or other properties, and thus process them as distinct glyphs (see section 5.3 Annotating Characters), or to define new characters or glyphs (section 5.4 Adding New Characters). We also provide recommendations concerning the Unicode Private Use Area (5.5 How to Use Code Points from the Private Use Area. Finally, in section 5.6 Writing Modes we discuss ways of documenting the writing mode used in a source text, that is, the directionality of the script, the orientation of individual characters, and related questions.

TEI: Is Your Journey Really Necessary?¶5.1 Is Your Journey Really Necessary?

Despite the availability of Unicode, text encoders still sometimes find that the published repertoire of available characters is inadequate to their needs. This is particularly the case when dealing with ancient languages, for which encoding standards do not yet exist, or where an encoder wishes to represent variant forms of a character or glyphs. The module defined by this chapter provides a mechanism to satisfy that need, while retaining compatibility with standards.

When encoders encounter some graphical unit in a document which is to be represented electronically, the first issue to be resolved should be ‘Is this really a different character?’ To determine whether a particular graphical unit is a character or not, see Terminology and Key Concepts.

If the unit is indeed determined to be a character, the next question should be ‘Has this character been encoded already?’ In order to determine whether a character has been encoded, encoders should follow the following steps:

Check the Unicode web site at http://www.unicode.org, in particular the page "Where is my Character?", and the associated character code charts. Alternatively, users can check the latest published version of The Unicode Standard (Unicode Consortium (2006)), though the web site is often more up to date than the printed version, and should be checked for preference.

The pictures (‘glyphs’) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under Annotating Characters. Remember that your encoded document may be rendered on a system which has different fonts from yours: if the specific form of a character is important to you, then you should document it.
Check the Proposed New Characters web page (http://unicode.org/alloc/Pipeline.html) to see whether the character is in line for approval.
Ask on the Unicode email list (http://www.unicode.org/consortium/distlist.html) to see whether a proposal is pending, or to determine whether this character is considered eligible for addition to the Unicode Standard.

Since there are now close to 100,000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Look again, this time at other sites, for example http://www.eki.ee/letter/, which also provide searches based on scripts and languages. Take care, however, that all the properties of what seems to be a relevant character are consistent with those of the character you are looking for. For example, if your character is definitely a digit, but the properties of the best match you can find for it say that it is a letter, you may have a character not yet defined in Unicode.

In general, it is advisable to avoid Unicode characters generally described as presentation forms.²⁴ However, if the character you are looking for is being used in a notation (rather than as part of the orthography of a language) then it is quite acceptable to select characters from the Mathematical Operators block, provided that they have the appropriate properties (i.e. So: Symbol, Other; or Sm: Symbol, Math).

An encoded character may be precomposed or it may be formed from base characters and combining diacritical marks. Either will suffice for a character to be "found" as an encoded character.

If there are several possible Unicode characters to choose amongst, it is good practice to consult other colleagues and practitioners to see whether a consensus has emerged in favour of one or other of them.

If, however, no suitable form of your character seems to exist, the next question will be: ‘Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?’ If the character is determined to be missing from the Unicode Standard, it would be helpful to submit the new character for inclusion (see http://unicode.org/pending/proposals.html).

These guidelines will help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called annotation of a character, while the second case will be called adding of a new character. How to handle graphical units that represent variants will be discussed below (5.3 Annotating Characters) while the problem of representing new characters will be dealt with in section 5.4 Adding New Characters.

While there is some overlap between these requirements, distinct specialized markup constructs have been created for each of these cases. These constructs are presented in section 5.2 Markup Constructs for Representation of Characters and Glyphs below.

TEI: Markup Constructs for Representation of Characters and Glyphs¶5.2 Markup Constructs for Representation of Characters and Glyphs

An XML document can, in principle, contain any defined Unicode character. The standard allows these characters to be represented either directly, using an appropriate encoding (UTF-8 by default), or indirectly by means of a numeric character reference (NCR), such as Ä (A-umlaut). The encoder can also restrict the range of characters which are represented directly in a document (or part of it) by adding a suitable encoding declaration. For example, if a document begins with the declaration <?xml encoding="iso-8859-1"?> any Unicode characters which are not in the ISO-8859-1 character set must be represented by NCRs.

The gaiji module defined by this chapter adds a further way of representing specific characters and glyphs in a document. (Gaiji is from Japanese 外字, meaning external characters.) This allows the encoder to distinguish characters and glyphs which Unicode regards as identical, to add new nonstandard characters or glyphs, and to represent Unicode characters not available in the document encoding by an alternative means.

The mechanism provided here consists functionally of two parts:

an element g, which serves as a proxy for new characters or glyphs
elements char and glyph, providing information about such characters or glyphs; these elements are stored in the charDecl element in the header.

When the gaiji module is included in a schema, the charDecl element is added to the model.encodingDescPart class, and the g element is added to the phrase class. These elements and their components are documented in the rest of this section.

The Unicode standard defines properties for all the characters it defines in the Unicode Character Database, knowledge of which is usually built into text processing systems. If the character represented by the g element does not exist in Unicode at all, its properties are not available. If the character represented is an existing Unicode character, but is not available in the document character set recognized by a given text processing system, it may also be convenient to have access to its properties in the same way. The char element makes it possible to store properties for use by such applications in a standard way.

The list of attributes (properties) for characters is modelled on those in the Unicode Character Database, which distinguishes normative and informative character properties. Additional, non-Unicode, properties may also be supplied. Since the list of properties will vary with different versions of the Unicode Standard, there may not be an exact correspondence between them and the list of properties defined in these Guidelines.

Usage examples for these elements are given below at 5.3 Annotating Characters and 5.4 Adding New Characters. The gaiji module itself is formally defined in section 5.10 Formal Definition below. It declares the following additional elements:

charDecl (문자 선언) 비표준 문자와 그림문자에 대한 정보를 제공한다.
g (문자 또는 그림문자) 비표준 문자 또는 그림문자를 표시한다.
ref points to a description of the character or glyph intended.

The charDecl element is a member of the class model.encodingDescPart, and thus becomes available within encodingDesc when this module is included in a schema. The g element is the only member of the class model.gLike: this class is referenced as an alternative to plain text in almost every element which contains plain text, thus permitting the g element also to appear at such places when this module is included in a schema.

The following elements may appear within a charDecl element:

desc (기술) 요소, 속성, 또는 속성 값의 목적과 적용에 대한 간단한 기술을 포함한다.
char (문자) 문자에 관한 기술 정보를 제공한다.
glyph (그림 문자) 그림 문자에 관한 기술적 정보를 제공한다.

The char and glyph elements have similar contents and are used in similar ways, but their functions are different. The char element is provided to define a character which is not available in the current document character set, for whatever reason, as stated above. The glyph element is used to annotate a character that has already been defined somewhere (either in the document character set, or through a char element) by providing a specific glyph that shows how a character appeared in the original document. This is necessary since Unicode code points refer not to a single, specific glyph shape of a character, but rather to a set of glyphs, any of which may be used to render the code point in question; in some cases they can differ considerably.

The glyph element is provided for cases where the encoder wants to specify a specific glyph (or family of glyphs) out of all possible glyphs. Unfortunately, due to the way Unicode has been defined, there are cases where several glyphs that logically belong together have been given separate code points, especially in the blocks defining East Asian characters. In such cases, glyph elements can also be used to express the view that these apparently distinct characters are to be regarded as instances of the same character (see further 5.3 Annotating Characters).

The Unicode Standard recommends naming conventions which should be followed strictly where the intention is to annotate an existing Unicode character, and which may also be used as a model when creating new names for characters or glyphs²⁵. For convenience of processing, the following distinct elements are proposed for naming characters and glyphs:

charName (문자명) 유니코드로 표현된 문자명을 포함한다.
glyphName (그림문자명) 문자명에 대한 유니코드 방식에 따라 표현된 그림 문자명을 포함한다.

Within both char and glyph, the following elements are available:

gloss 다른 단어나 구에 대한 해설 또는 정의를 제공할 때 사용되는 구나 단어를 표시한다.
charProp (문자 특성) 상위 문자 또는 그림문자의 특성에 대한 이름과 값을 제시한다.
desc (기술) 요소, 속성, 또는 속성 값의 목적과 적용에 대한 간단한 기술을 포함한다.
mapping (문자 사상) type 상위 문자 또는 그림 문자와 관련된 하나 이상의 문자들을 포함한다. 속성으로 명시된다.
figure 삽화 또는 그림과 같은 시각 정보를 표시하거나 포함하는 요소를 모아 놓는다.
note contains a note or annotation.

Four of these elements (gloss, desc, figure, and note) are defined by other TEI modules, and their usage here is no different from their usage elsewhere. The figure element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. The figure element may contain more than one graphic element, for example to provide images with different resolution, or in different formats, or may itself be repeated. As elsewhere, the mimeType attribute of graphic should be used to specify the format of the image.

The mapping element is similar to the standard TEI equiv element. While the latter is used to express correspondence relationships between TEI concepts or elements and those in other systems or ontologies, the former is used to express any kind of relationship between the character or glyph under discussion and characters or glyphs defined elsewhere. It may contain any Unicode character, or a g element linked to some other char or glyph element, if, for example, the intention is to express an association between two non-standard characters. The type of association is indicated by the type attribute, which may take such values as exact for exact equivalences, uppercase for uppercase equivalences, lowercase for lowercase equivalences, standard for standardized forms, and simplified for simplified characters, etc., as in the following example:

<charDecl>
<char xml:id="aenl">
  <charName>LATIN LETTER ENLARGED SMALL A</charName>
  <charProp>
   <localName>entity</localName>
   <value>aenl</value>
  </charProp>
  <mapping type="standard">a</mapping>
</char>
</charDecl>

`Lu`	Letter, uppercase
`Ll`	Letter, lowercase
`Lt`	Letter, titlecase
`Lm`	Letter, modifier
`Lo`	Letter, other
`Mn`	Mark, nonspacing
`Mc`	Mark, spacing combining
`Me`	Mark, enclosing
`Nd`	Number, decimal digit
`Nl`	Number, letter
`No`	Number, other
`Pc`	Punctuation, connector
`Pd`	Punctuation, dash
`Ps`	Punctuation, open
`Pe`	Punctuation, close
`Pi`	Punctuation, initial quote
`Pf`	Punctuation, final quote
`Po`	Punctuation, other
`Sm`	Symbol, math
`Sc`	Symbol, currency
`Sk`	Symbol, modifier
`So`	Symbol, other
`Zs`	Separator, space
`Zl`	Separator, line
`Zp`	Separator, paragraph
`Cc`	Other, control
`Cf`	Other, format
`Cs`	Other, surrogate
`Co`	Other, private use
`Cn`	Other, not assigned

`L`	left to right
`LRE`	left to right embedding
`LRO`	left to right override
`R`	right to left
`AL`	right to left Arabic
`RLE`	right to left embedding
`RLO`	right to left override
`PDF`	Pop Directional Format
`EN`	European Number
`ES`	European Number Separator
`ET`	European Number Terminator
`AN`	Arabic Number
`CS`	Common Number Separator
`NSM`	Non-spacing Mark
`BN`	Boundary Neutral
`B`	Paragraph separator
`S`	Segment separator
`WS`	Whitespace
`ON`	Other neutrals

`0`	Spacing, split, enclosing, reordrant, and Tibetan subjoined
`1`	Overlays and interior
`7`	Nuktas
`8`	Hiragana/Katakana voicing marks
`9`	Viramas
`10`	Start of fixed position classes
`199`	End of fixed position classes
`200`	Below left attached
`202`	Below attached
`204`	Below right attached
`208`	Left attached (reordrant around single base character)
`210`	Right attached
`212`	Above left attached
`214`	Above attached
`216`	Above right attached
`218`	Below left
`220`	Below
`222`	Below right
`224`	Left (reordrant around single base character)
`226`	Right
`228`	Above left
`230`	Above
`232`	Above right
`233`	Double below
`234`	Double above
`240`	Below (iota subscript)

`font`	A font variant (e.g. a blackletter form)
`noBreak`	A no-break version of a space or hyphen
`initial`	An initial presentation form (Arabic)
`medial`	A medial presentation form (Arabic)
`final`	A final presentation form (Arabic)
`isolated`	An isolated presentation form (Arabic)
`circle`	An encircled form
`super`	A superscript form
`sub`	A subscript form
`vertical`	A vertical layout presentation form
`wide`	A wide (or zenkaku) compatibility character
`narrow`	A narrow (or hankaku) compatibility character
`small`	A small variant form (CNS compatibility)
`square`	A CJK squared font variant
`fraction`	A vulgar fraction form
`compat`	Otherwise-unspecified compatibility character

P5: 전자 텍스트 부호화 및 교환에 대한 지침