5 Representation of Non-standard Characters and Glyphs
Contenu
Despite the availability of Unicode, text encoders still sometimes find that the published repertoire of available characters is inadequate to their needs. This is particularly the case when dealing with ancient languages, for which encoding standards do not yet exist, or where an encoder wishes to represent variant forms of a character or glyphs. The module defined by this chapter provides a mechanism to satisfy that need, while retaining compatibility with standards.
5.1 Is Your Journey Really Necessary? Is Your Journey Really Necessary?¶
When encoders encounter some graphical unit in a document which is to be represented electronically, the first issue to be resolved should be ‘Is this really a different character?’ To determine whether a particular graphical unit is a character or not, see Terminology and key concepts.
If the unit is indeed determined to be a character, the next question should be ‘Has this character been encoded already?’ In order to determine whether a character has been encoded, encoders should follow the following steps:
Check the Unicode web site at http://www.unicode.org, in particular the page "Where is my Character?", and the associated character code charts. Alternatively, users can check the latest published version of The Unicode Standard (Unicode Consortium (2006)), though the web site is often more up to date than the printed version, and should be checked for preference.
The pictures (‘glyphs’) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under Annotating Characters. Remember that your encoded document may be rendered on a system which has different fonts from yours: if the specific form of a character is important to you, then you should document it.
- Check the Proposed New Characters web page (http://unicode.org/alloc/Pipeline.html) to see whether the character is in line for approval.
- Ask on the Unicode email list (http://www.unicode.org/consortium/distlist.html) to see whether a proposal is pending, or to determine whether this character is considered eligible for addition to the Unicode Standard.
Since there are now close to 100,000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Look again, this time at other sites, for example http://www.eki.ee/letter, which also provide searches based on scripts and languages. Take care, however, that all the properties of what seems to be a relevant character are consistent with those of the character you are looking for. For example, if your character is definitely a digit, but the properties of the best match you can find for it say that it is a letter, you may have a character not yet defined in Unicode.
In general, it is advisable to avoid Unicode characters generally
described as presentation forms.21 However, if the character you are looking for
is being used in a notation (rather than as part of the orthography of
a language) then it is quite acceptable to select characters from the
Mathematical Operators block, provided that they have the appropriate
properties (i.e. So
: Symbol, Other; or Sm
:
Symbol, Math).
An encoded character may be precomposed or it may be formed from base characters and combining diacritical marks. Either will suffice for a character to be "found" as an encoded character.
If there are several possible Unicode characters to choose amongst, it is good practice to consult other colleagues and practitioners to see whether a consensus has emerged in favour of one or other of them.
If, however, no suitable form of your character seems to exist, the next question will be: ‘Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?’ If the character is determined to be missing from the Unicode Standard, it would be helpful to submit the new character for inclusion (see http://unicode.org/pending/proposals.html).
These guidelines will help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called annotation of a character, while the second case will be called adding of a new character. How to handle graphical units that represent variants will be discussed below (5.3 Annotating Characters) while the problem of representing new characters will be dealt with in section 5.4 Adding New Characters.
While there is some overlap between these requirements, distinct specialized markup constructs have been created for each of these cases as explained in section 5.2 Markup Constructs for Representation of Characters and Glyphs below. The following sections will then proceed to discuss how to apply them to the problems at hand, discussing annotation of existing characters in section 5.3 Annotating Characters and finally creation of new ones in 5.4 Adding New Characters.
5.2 Markup Constructs for Representation of Characters and Glyphs Markup Constructs for Representation of Characters and Glyphs¶
An XML document can, in principle, contain any defined Unicode
character. The standard allows these characters to be represented
either directly, using an appropriate encoding (UTF-8 by default), or
indirectly by means of numeric character references (NCR), such as
Ä
(A-umlaut). The encoder can also restrict the
range of characters which are represented directly in a document (or
part of it) by adding a suitable encoding declaration. For example, if
a document begins with the declaration <?xml
encoding="iso-8859-1"?>
any Unicode characters which are not
in the ISO-8859-1 character set must be represented by NCRs.
The gaiji module defined by this chapter adds a further way of representing specific characters and glyphs in a document. (Gaiji is from Japanese 外字, meaning external characters.) This allows the encoder to distinguish characters and glyphs which Unicode regards as identical, to add new nonstandard characters or glyphs, and to represent Unicode characters not available in the document encoding by an alternative means.
The mechanism provided here consists functionally of two parts:
- an element g, which serves as a proxy for new characters or glyphs
- elements char and glyph, providing information about such characters or glyphs; these elements are stored in the charDecl element in the header.
When the gaiji module is included in a schema, the charDecl element is added to the model.encodingDescPart class, and the g element is added to the phrase class. These elements and their components are documented in the rest of this section.
The Unicode standard defines properties for all the characters it defines in the Unicode Character Database, knowledge of which is usually built into text processing systems. If the character represented by the g element does not exist in Unicode at all, its properties are not available. If the character represented is an existing Unicode character, but is not available in the document character set recognized by a given text processing system, it may also be convenient to have access to its properties in the same way. The char element makes it possible to store properties for use by such applications in a standard way.
The list of attributes (properties) for characters is modelled on those in the Unicode Character Database, which distinguishes normative and informative character properties. Additional, non-Unicode, properties may also be supplied. Since the list of properties will vary with different versions of the Unicode Standard, there may not be an exact correspondence between them and the list of properties defined in these Guidelines.
Usage examples for these elements are given below at 5.3 Annotating Characters and 5.4 Adding New Characters. The gaiji module itself is formally defined in section 5.6 Module Character and Glyph Documentation below. It declares the following additional elements:
- charDecl (description de caractère) fournit des informations sur des caractères ou des glyphes sortant de l'ordinaire
- g (caractère ou glyphe) représente un glyphe,
ou un caractère non standard
ref pointe vers la description du caractère ou du glyphe visé
The charDecl element is a member of the class model.encodingDescPart, and thus becomes available within encodingDesc when this module is included in a schema. The g element is the only member of the class model.gLike: this class is referenced as an alternative to plain text in almost every element which contains plain text, thus permitting the g element also to appear at such places when this module is included in a schema.
The following elements may appear within a charDecl element:
- desc (description) contient une courte description de l'objet documenté par son élément parent, qui comprend son utilisation prévue, son but, ou son application là où c'est approprié.
- char (caractère) fournit des informations descriptives sur un caractère
- glyph (glyphe d'un caractère) fournit des informations descriptives sur un glyphe
The char and glyph elements have similar contents and are used in similar ways, but their functions are different. The char element is provided to define a character which is not available in the current document character set, for whatever reason, as stated above. The glyph element is used to annotate a character that has already been defined somewhere (either in the document character set, or through a char element) by providing a specific glyph that shows how a character appeared in the original document. This is necessary since Unicode code points refer not to a single, specific glyph shape of a character, but rather to a set of glyphs, any of which may be used to render the code point in question; in some cases they can differ considerably.
The glyph element is provided for cases where the encoder wants to specify a specific glyph (or family of glyphs) out of all possible glyphs. Unfortunately, due to the way Unicode has been defined, there are cases where several glyphs that logically belong together have been given separate code points, especially in the blocks defining East Asian characters. In such cases, glyph elements can also be used to express the view that these apparently distinct characters are to be regarded as instances of the same character (see further 5.3 Annotating Characters).
The Unicode Standard recommends naming conventions which should be followed strictly where the intention is to annotate an existing Unicode character, and which may also be used as a model when creating new names for characters or glyphs
U+4E00
’, where U+4E00
is simply the
Unicode code point value of the character in question. In cases where
no Unicode code point exists, there is little hope of finding a name
that helps to identify the character. Names should therefore be
constructed in a way meaningful to local practice, for example by
using a reference number from a well-known character dictionary or a
project-specific serial number.. For convenience of processing, the following distinct elements are proposed for naming characters and glyphs:
- charName (nom de caractère) contient le nom d'un caractère exprimé selon les conventions de l'Unicode
- glyphName (nom du glyphe d'un caractère) contient le nom d'un glyphe, exprimé selon les conventions de l'Unicode pour les noms de caractère
Within both char and glyph, the following elements are available:
- gloss (glose) identifie une expression ou un mot utilisé pour fournir une glose ou une définition à quelque autre mot ou expression.
- charProp (propriété d'un caractère) fournit un nom et une valeur pour une propriété quelconque d'un caractère ou d'un glyphe défini dans l'élément parent
- desc (description) contient une courte description de l'objet documenté par son élément parent, qui comprend son utilisation prévue, son but, ou son application là où c'est approprié.
- mapping (caractères associés) contient un ou plusieurs caractères reliés par certains aspects (spécifiés par l'attribut type) au glyphe ou au caractère défini dans l'élément parent
- figure (figure) Regroupe des éléments représentant ou contenant une information graphique comme une illustration ou une figure.
- note contient une note ou une annotation
Four of these elements (gloss, desc, figure, and note) are defined by other TEI modules, and their usage here is no different from their usage elsewhere. The figure element, however, is used here only to link to an image of the character or glyph under discussion, or to contain a representation of it in SVG. The figure element may contain more than one graphic element, for example to provide images with different resolution, or in different formats, or may itself be repeated. As elsewhere, the mimeType attribute of graphic should be used to specify the format of the image.
exact
for exact equivalences, uppercase
for
uppercase equivalences, lowercase
for lowercase
equivalences, standard
for standardized forms, and
simplified
for simplified characters, etc., as in the
following example: <char xml:id="aenl">
<charName>LATIN LETTER ENLARGED SMALL A</charName>
<charProp>
<localName>entity</localName>
<value>aenl</value>
</charProp>
<mapping type="standard">a</mapping>
</char>
</charDecl>
<glyph xml:id="z103">
<glyphName>LATIN LETTER Z WITH TWO STROKES</glyphName>
<mapping type="standard">Z</mapping>
<mapping type="PUA">U+E304</mapping>
</glyph>
</charDecl>
A more precise documentation of the properties of any character or glyph may be supplied using the generic charProp element described in the next section. Despite its name, this element may be used for either characters or glyphs.
5.2.1 Character Properties Character Properties¶
The Unicode Standard documents ‘ideal’
characters, defined by reference to a number of
properties (or attribute-value pairs) which they are said
to possess. For example, a lowercase letter is said to have the value
Ll
for the property general-category
. The
Standard distinguishes between normative properties
(i.e. properties which form part of the definition of a given
character), and informative or additional
properties which are not normative. It also allows for the addition of
new properties, and (in some circumstances) alteration of the values
currently assigned to certain properties. When making such
modifications, great care should be taken not to override standard
informative properties for characters which already exist in the Unicode
Standard, as documented in Freytag (2006).
The charProp element allows an encoder to supply information about a character or glyph. Where the information concerned relates to a property which has already been identified in the Unicode Standard, encoders are urged to use the appropriate Unicode property name.
The following elements are used to record character properties:
- unicodeName (nom de propriété Unicode) contient le nom d'une propriété normative ou informative enregistré en Unicode
- localName (nom de propriété défini localement) contient un nom défini localement pour une propriété
- value (valeur) contient une valeur unique pour une propriété, pour un attribut ou pour tout autre élément d'analyse
For each property, the encoder must supply either a unicodeName or a localName, followed by a value.
For convenience, we list here some of the normative character properties and their values. For full information, refer to chapter 4 of The Unicode Standard, or the online documentation of the Unicode Character Database.
- general-category
- The general
category (described in the Unicode Standard chapter 4 section 5) is an assignment to some
major classes and subclasses of characters. Suggested
values for this property are listed here:
Lu
Letter, uppercase Ll
Letter, lowercase Lt
Letter, titlecase Lm
Letter, modifier Lo
Letter, other Mn
Mark, nonspacing Mc
Mark, spacing combining Me
Mark, enclosing Nd
Number, decimal digit Nl
Number, letter No
Number, other Pc
Punctuation, connector Pd
Punctuation, dash Ps
Punctuation, open Pe
Punctuation, close Pi
Punctuation, initial quote Pf
Punctuation, final quote Po
Punctuation, other Sm
Symbol, math Sc
Symbol, currency Sk
Symbol, modifier So
Symbol, other Zs
Separator, space Zl
Separator, line Zp
Separator, paragraph Cc
Other, control Cf
Other, format Cs
Other, surrogate Co
Other, private use Cn
Other, not assigned - directional-category
- This property applies to all Unicode characters. It governs the
application of the algorithm for bi-directional behaviour, as further
specified in Unicode Annex 9, The Bidirectional
Algorithm. The following 19 different values are currently
defined for this property in Davis et al (2006):
L
left to right LRE
left to right embedding LRO
left to right override R
right to left AL
right to left Arabic RLE
right to left embedding RLO
right to left override PDF
Pop Directional Format EN
European Number ES
European Number Separator ET
European Number Terminator AN
Arabic Number CS
Common Number Separator NSM
Non-spacing Mark BN
Boundary Neutral B
Paragraph separator S
Segment separator WS
Whitespace ON
Other neutrals - canonical-combining-class
- This
property exists for characters that are not used
independently, but in combination with other characters, for
example the strokes making up CJK (Chinese, Japanese, and Korean) characters. It
records a class for these characters, which is used to
determine how they interact typographically. The following
values are defined in the Unicode Standard 5.0: (see Unicode
Character Database: Canonical Combining Class Values)
0
Spacing, split, enclosing, reordrant, and Tibetan subjoined 1
Overlays and interior 7
Nuktas 8
Hiragana/Katakana voicing marks 9
Viramas 10
Start of fixed position classes 199
End of fixed position classes 200
Below left attached 202
Below attached 204
Below right attached 208
Left attached (reordrant around single base character) 210
Right attached 212
Above left attached 214
Above attached 216
Above right attached 218
Below left 220
Below 222
Below right 224
Left (reordrant around single base character) 226
Right 228
Above left 230
Above 232
Above right 233
Double below 234
Double above 240
Below (iota subscript) - character-decomposition-mapping
- This property is defined for characters,
which may be decomposed, for example to a canonical form
plus a typographic variation of some kind. For such characters the Unicode standard specifies both
a decomposition type and a decomposition mapping
(i.e. another Unicode character to which this one may be
mapped in the way specified by the decomposition type). The
following types of mapping are defined in the Unicode Standard:
font
A font variant (e.g. a blackletter form) noBreak
A no-break version of a space or hyphen initial
An initial presentation form (Arabic) medial
A medial presentation form (Arabic) final
A final presentation form (Arabic) isolated
An isolated presentation form (Arabic) circle
An encircled form super
A superscript form sub
A subscript form vertical
A vertical layout presentation form wide
A wide (or zenkaku) compatibility character narrow
A narrow (or hankaku) compatibility character small
A small variant form (CNS compatibility) square
A CJK squared font variant fraction
A vulgar fraction form compat
Otherwise-unspecified compatibility character - numeric-value
- This property applies for any character which expresses any kind of numeric value. Its value is the intended value in decimal notation.
- mirrored
- The mirrored
character property is used to properly render characters such
as U+0028,
OPENING PARENTHESIS
independent of the text direction: it has the valueY
(character is mirrored) orN
(code is not mirrored).
The Unicode Standard also defines a set of informative (but non-normative) properties for Unicode characters. If encoders want to provide such properties, they may be included using the suggested Unicode name, tagged using the unicodeName element. However, encoders may also supply other locally-defined properties, which must be named using the localName element to distinguish them. If a Unicode name exists for a given property, it should however always be preferred to a locally defined name. Locally defined names should be used only for properties which are not specified by the Unicode Standard.
5.3 Annotating Characters Annotating Characters¶
Annotation of a character becomes necessary when it is desired to distinguish it on the basis of certain aspects (typically, its graphical appearance) only. In a manuscript, for example, where distinctly different forms of the letter "r" can be recognized, it might be useful to distinguish them for analytic purposes, quite distinct from the need to provide an accurate representation of the page. A digital facsimile, particularly one linked to a transcribed and encoded version of the text, will always provide a superior visual representation (for information on how to link a digital facsimile to a transcribed text see 11.1 Digital Facsimiles), but cannot be used to support arguments based on the distribution of such different forms. Character annotation as described here provides a solution to this problem.22
<glyph xml:id="r1">
<glyphName>LATIN SMALL LETTER R WITH ONE FUNNY STROKE</glyphName>
<charProp>
<localName>entity</localName>
<value>r1</value>
</charProp>
<figure>
<graphic url="r1img.png"/>
</figure>
</glyph>
<glyph xml:id="r2">
<glyphName>LATIN SMALL LETTER R WITH TWO FUNNY STROKES</glyphName>
<charProp>
<localName>entity</localName>
<value>r2</value>
</charProp>
<figure>
<graphic url="r2img.png"/>
</figure>
</glyph>
</charDecl>
manusc<g ref="#r2">r</g>ipt are sometimes
written in a funny way.</p>
<!-- in the charDecl -->
<glyph xml:id="Filig">
<glyphName>LATIN UPPER F AND LATIN LOWER I LIGATURE</glyphName>
<figure>
<graphic url="Filig.png"/>
</figure>
</glyph>
<g ref="#per">per</g>
</abbr> ardua</p>
<!-- in the charDecl -->
<glyph xml:id="per">
<glyphName>LATIN ABBREVIATION PER</glyphName>
<figure>
<graphic url="per.png"/>
</figure>
</glyph>
Fi
ligature; the encoder may however prefer not to
use it in order to simplify other text processing operations,
such as indexing). With this markup in place, it will be possible to write programs to analyze the distribution of the different letters "r" as well as produce more ‘faithful’ renderings of the original. It will also be possible to produce normalized versions by simply ignoring the annotation pointed to by the element g.
For brevity of encoding, it may be preferred to predefine internal entities such as the following:
<!ENTITY r1 '<g ref="#r1">r</g>' > <!ENTITY r2 '<g ref="#r2">r</g>' >
which would enable the same material to be encoded as follows:
<p>Wo&r1;ds in this manusc&r2;ipt are sometimes written in a funny way.</p>
The same technique may be used to represent particular
abbreviation marks as well as to represent other characters or
glyphs. For example, if we believe that the r-with-one-funny-stroke is
being used as an abbreviation for receipt
, this might be
represented as follows:
<abbr>&r1;</abbr>
Note however that this technique employs markup objects to provide a link between a character in the document and some annotation on that character. Therefore, it cannot be used in places where such markup constructs are not allowed, notably in attribute values.
<glyph xml:id="u8aaa">
<mapping type="Unicode">說</mapping>
<mapping type="standard">説</mapping>
</glyph>
</charDecl>
<char xml:id="newchar1">
<!-- more properties here -->
</char>
<glyph xml:id="varofnewchar1">
<!-- more properties here -->
<mapping type="standard">
<g ref="#newchar1"/>
</mapping>
</glyph>
</charDecl>
5.4 Adding New Characters Adding New Characters¶
The creation of additional characters for use in text encoding is quite similar to the annotation of existing characters. The same element g is used to provide a link from the character instance in the text to a character definition provided within the charDecl element. This character definition takes the form of a char element. The element g itself will usually be empty, but could contain a code point from the Private Use Area (PUA) of the Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to use such PUA characters are given in the following section.
&ydotacute;
, which when the
transcription is processed can then be expanded in one of three ways,
depending on the mapping in force. The entity reference might be
translated into the sequence of corresponding Unicode code points
or into some locally-defined PUA character
(say 
) for local
processing only. Both these options have disadvantages; the former
loses the fact that the sequence of composed characters is regarded as
a single object; the second is not reliably portable.
Therefore, the recommended
representation is to use the g element defined by
the module defined in this chapter: <charName>LATIN SMALL LETTER Y WITH DOT ABOVE AND
ACUTE</charName>
<charProp>
<localName>entity</localName>
<value>ydotacute</value>
</charProp>
<mapping type="composed">ẏ́</mapping>
<mapping type="PUA">U+E0A4</mapping>
</char>
人
)
the circled variant might conveniently be represented as
<charName>CIRCLED IDEOGRAPH</charName>
<charProp>
<unicodeName>character-decomposition-mapping</unicodeName>
<value>circle</value>
</charProp>
<charProp>
<localName>daikanwa</localName>
<value>36</value>
</charProp>
<mapping type="standard"> 人
</mapping>
<mapping type="PUA"> 
</mapping>
</char>
In this example, the ‘circled ideograph’
character has been defined with two mappings, and with two
properties. The two properties are the Unicode-defined
character-decomposition which specifies that this is a circled
character, using the appropriate terminology (see 5.2.1 Character Properties above) and a locally defined property known as
‘daikanwa’ . The two
mappings indicate firstly that the standard form of this character is
the character 人
, and secondly that the
character used to represent this character locally is the PUA
character 
. For convenience of local
processing this PUA character may in fact appear as content of
the g element. In general, however, the g element
will be empty.
5.5 How to Use Code Points from the Private Use Area How to Use Code Points from the Private Use Area¶
The developers of the Unicode Standard have set aside an area of the codespace for the private use of software vendors, user groups, or individuals. As of this writing (Unicode 5.0), there are around 137,000 code points available in this area, which should be enough for most needs. No code point assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which may be overridden where necessary by the mechanism outlined in this chapter). Therefore, unlike all other code points defined by the Unicode Standard, PUA code points should not be used directly in documents intended for blind interchange.
In the two previous examples, we mentioned that the variant characters concerned might well be assigned specific code points from the PUA. This might, for example, facilitate the use of a particular font which displays the desired character at this code point in the local processing environment. Since however this assignment would be valid only on the local site, documents containing such code points are unsuitable for blind interchange. During the process of preparing such documents for interchange, any PUA code points should be replaced by an appropriate use of the g element, such as <g ref="#xxxx">, thus associating the character required with the documentation of it provided by the referenced char element. The PUA character used during the preparation of the document might be recorded in the char element, as shown in the example in 5.4 Adding New Characters, or retained as content of the g element. However, since there is no requirement that the same PUA character be used to represent it at the receiving site, and since it may well be the case that this other site has already made an assignment of some other character to the original PUA code point, it is best practice to remove the locally-defined PUA character. It is to be expected that a further translation into the local processing environment at the receiving site will be necessary to handle such characters, during which variant letters can be converted to hitherto unused code points on the basis of the information provided in the char element.
This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which may increasingly be the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the char element as a handle to the character in other contexts.
In the fullness of time, a character may become standardized, and thus assigned a specific code point outside the PUA. Documents which have been encoded using the mechanism must at the least ensure that this changed code point is recorded within the relevant char element; it will however normally be simpler to remove the char element and replace all occurrences of g elements which reference it by occurrences of the newly coded character.
5.6 Module Character and Glyph Documentation Module Character and Glyph Documentation¶
The module described in this chapter makes available the following components:
- Module gaiji: Représentation des caractères et des glyphes non standard
The selection and combination of modules to form a TEI schema is described in 1.2 Defining a TEI Schema.