CE W 06: Representation of non-standard characters and glyphs
Contents
- Overview
- Markup constructs for representing non-standard characters
- Annotating characters
- Adding new characters
- How to use codepoints from the Private Use Area
Overview
Text encoders do come in situations where the repertoire of characters and glyphs available in published standards do not seem sufficient to convey the material to be encoded aptly. These Guidelines provide a mechanism to deal with such a situation, which is outlined in this chapter.
If encoders encounter some graphical unit in a document they want to render electronically, the first question that needs to be asked is: ‘Is this a character?’ To determine whether a particular graphical unit is a character or not, see Terminology and key concepts.
-
Check the Unicode website (www.unicode.org, first reading the webpage "Where is my Character?" http://unicode.org/standard/where/, then the code charts). Alternatively, users can check the latest published version of The Unicode Standard, though the website is often more up to date than the printed version, and should be checked with preference.
The pictures (‘glyphs’) in the Unicode code charts are only meant to be representative, not definitive. If a specific form of an already encoded character is required for a project, refer to the guidelines contained below under Annotating Characters.
- Check the Proposed New Characters webpage (http://unicode.org/alloc/Pipeline.html) to see if the character is in line for approval.
- Ask on the Unicode email list to determine if a proposal is pending, or whether this is indeed a new character (or if this not a character at all, in which case it would not be eligible for addition to the Unicode standard).
Since there are now close to 100000 characters in Unicode, chances are good that what you need is already there, but it might not be easy to find, since it might have a different name in Unicode. Try a bit harder and use other sites, for example http://www.eki.ee/letter, which allows also searches based on scripts and languages.
An encoded character may be precomposed or it may be formed from base characters and combining diacritical marks. Either will suffice for a character to be "found" as an encoded character.
If this first question has been considered and no suitable form has been found in such a repertoire, the next question will be: ‘Does the graphical unit in question represent a variant form of a known character, or does it represent a completely unencoded character?’ If the character is determined to be missing from Unicode, it would be helpful to submit the new character for inclusion (see http://unicode.org/pending/proposals.html).
These guidelines will try to help you proceed once you have identified a given graphical unit as either a variant or an unencoded character. Determining this will require knowledge of the contents of the document that you have. The first case will be called annotation of a character, while the second case will be called adding of a new character. How to handle graphical units that represent variants will be discussed below under Annotating characters, while the problem of representing new characters will be dealt with under the section Adding new characters.
While there is some overlap between these requirements, separate, specialized markup constructs have been created for each of these cases as explained in the section "Markup constructs for representing non-standard characters, below. The following section will then proceed to discuss how to apply them to the problems at hand, discussing the annotation in "Annotating characters" and finally the creation in "Adding new characters".
Markup constructs for representing non-standard characters
The ‘TEI WSD-NG’ provides a mechanism to declare characters in addition to those that are available in the document character set [Note: In most cases, the document character will be the Unicode. XML does however also allow a document to be in a subset of Unicode. In these cases the extensions declared by the ‘TEI WSD-NG’ might in fact be characters of Unicode, but outside of the documents subset. ] Functionally, the ‘TEI WSD-NG’ is part of the TEI header, but for larger document collections it might be more convenient to maintain it separately and include it with the standard XML provisions.
The main function of the ‘TEI WSD-NG’ is to provide attributes for a character and optionally a handle to this character, if there is not already one. The list of attributes for characters is modelled on those in the Unicode Character Database, which distinguishes normative and informative character properties. Apart from that, additional attributes can be given. Since the list of properties will vary with different versions of The Unicode Standard, there might not be an exact correspondence with the list of properties defined in these Guidelines. If additional properties are required, they may be added under <addProp> .
The element <charDesc> contains a list of either <char> elements, each of which describe a character or <glyph> elements, each of which provide a glyph and some additional information. Optionally, it can also hold a <desc> element with a general description and information pertaining to all characters or glyphs for which information is given.
The <char> element for adding new characters to the document character set
- <charName> (required) A name to identify the character. For characters of non-ideographic scripts, a name following the conventions for Unicode names should be chosen. For ideographic scripts, an Ideographic Description Sequence (IDS) as described in Chapter 10.1 of The Unicode Standard is recommended where possible. These recommendations are given in an attempt to make blind interchange as successful as possible. Projects working in the same or neighbouring fields are well advised at coordinating and publishing their list of <charName> s in order to make data exchange even more successful. If an entity reference is used, a corresponding <addProp> element should be used to record this.
-
<normProp>
(required) This is an
empty element which takes a number of properties as its
attribute values. More information about the normative character
properties in Unicode can be found at The Unicode Standard, Version 3.0, Addison and Wesley,
p. 73, Table 4-1).
Attribute list
- ucs This gives the codepoint assigned to the character, if such an assignment is used.
- general-category The general category (described in The Unicode Standard 4.5) is an assignment to some major classes and subclasses of characters. The value of this property has to be selected from the list of predeclared values. The default value is "Lo", which means "Letter, other". Please make sure the approptiate values for this attribute are provided, for example "Ll" for lowercase letter.
- canonical-combining-class This property exists for characters that are not used independently, but in combination with other characters. It records a class for these characters, which is used to determine how character interact typographically. For more information, see The Unicode Standard 4.2.
- directional-category All Unicode characters possess a directinal type, which governs the application of the algorithm for bi-directional behaviour. The default for this category as defined in these Guidelines is "L" which means "Left-to-Right".
- character-decomposition-mapping This is used to determine the relationship to other character(s). The Unicode Standard contains a list of tags used for this purpose.
- numeric-value The numeric value (in decimal notation) of a character that expresses any kind of numeric value.
- mirrored The mirrored character property is used to properly render characters such as U+0028, OPENING PARENTHESIS independent of the text direction.
- <infProp> (optional) A set of additional, informative properties is given for Unicode characters in The Unicode Standard. If encoders want to provide such properties, they should go here and use the same naming conventions as in The Unicode Standard.
- <addProp> (optional) This element can hold a list of <prop> elements that give additional character properties. These properties do not parallel properties given in the The Unicode Standard.
- <desc> (optional) A prose description of the character, the type attribute can be used to categorize these descriptions.
-
<mapping>
(optional, multiple occurrences possible)
This element can contain one or more
characters, that do have some kind of relationship to this
character. The type of relationship is expressed with the
type attribute on
<mapping>
. The
<c>
elements themselves can point to either another
<char>
or
<glyph>
element or contain a character
that is intended to be the target of this mapping. This
could be used, among other things, to point to lowercase or
uppercase equivalents of this character.
Attribute list
- type(required) The type of mapping. The typology used can be further explained in a suitable section of the encoding description in the header.
-
<glyphImg>
(optional) This points
to a place where a glyph image of this character can be found or
might even contain a glyph description inline, for example in
SVG. Several
<glyphImg>
can be given, for example for
glyph images of different resolution, or different types of image data.
Attribute list
- type(optional) The type attribute can be used to record the data type of the image, for example using MIME-types as described in RFC 2046 of the Internet Engineering Task Force. .
- <note> (optional) Any type of additional noteworthy information that would not be suitable to be contained in <desc> .
The <glyph> element for specifying how a character appears in the document
- <glyphName> (optional) A name to identify the glyph. The name should follow the same conventions as the <charName> above.
- <addProp> (optional) This element can hold a list of <prop> elements, that give additional character properties. These properties do not parallel properties given in The Unicode Standard.
- <desc> (optional) A prose description of the character, the type attribute can be used to categorize these descriptions.
-
<mapping>
(optional, multiple occurrences possible)
This element can contain one or more
characters, that do have some kind of relationship to this
character. The type of relationship is expressed with the
type attribute on
<mapping>
. The
<c>
elements themselve can point to either another
<char>
or
<glyph>
element or contain a character
that is intended to be the target of this mapping. This
could be used, among other things, to point to lowercase or
uppercase equivalents of this character.
Attribute list
- type(required) The type of mapping. If the character described in the current <glyph> element is a pre-modern form, this could for example be set to modern to indicate the equivalent modern character. The typology used can be further explained in a suitable section of the encoding description in the header.
-
<glyphImg>
(optional) This points
to a place where a glyph image of this character can be found or
might even contain a glyph description inline, for example in
SVG. Several
<glyphImg>
can be given, for example for
glyph images of different resolution, or different types of image data.
Attribute list
- type(optional) The type attribute can be used to record the data type of the image, for example using MIME-types as described in RFC 2046 of the Internet Engineering Task Force. .
- <note> (optional) Any type of additional noteworthy information that would not be suitable to be contained in <desc> .
The DTD fragment
Annotating characters
As can be seen in this example, the <glyph> element pointed to from the <c> element will be interpreted as an annotation on the content of the element <c> . It is thus possible to use this mechanism to indicate ligatures [Note: While technically this could be used to indicate abbreviations, within the framework of these Guidelines, it is recommended practice to employ the <abbr> element, see .]. With this markup in place, it will be possible to write programs to analyze the distribution of the different letters "r", produce ‘faithful’ renderings that use the original glyph, but also to produce normalized versions by simple ignoring the annotation pointed to by the element <c> . To make this kind of processing more efficient, the "type" attribute on <c> can be used, with an enumeration of different types and their usage documented in the TEIHeader.
Since this mechanism employs markup objects to provide a link between a character in the document and some annotation on that character, it can not be used in places where such markup constructs are not allowed, e.g. in attribute values.
Adding new characters
The creation of additional characters for use in text encoding is similar to annotating an existing character. The same element <c> is used to provide a link from the character instance in the text to the character definition in the document header (or elsewhere). The main difference is that the <c> element now points to a <char> element. Also, the content of this element could be empty. The element <c> could however also hold a codepoint from the Private Use Area (PUA) of The Unicode Standard, which is an area set aside for the very purpose of privately adding new characters to a document. Recommendations on how to assign such PUA characters are given in the following section.
Under certain circumstances, Han characters can be written within a circle. While this could be considered simply a facet of the rendering, it can also be considered a new, derived character, which will be in many ways similar to the original, non-circled character, but has a distinct rendering. The following example will provide the necessary markup to encode such an encircled character.
How to use codepoints from the Private Use Area
The developers of the Universal Character Set have set aside an area of the codespace for the private use of software vendors, user groups or individuals. As of this writing (Unicode 4.0), there are around 137000 codepoints available in this area, which should be enough for most needs. No codepoint assignments will be made to this area by standard bodies and only some very basic default properties have been assigned (which will be overwritten where necessary by the mechanism outlined in this chapter). Therefore, in contrast to all other codepoints of the UCS, PUA codepoints should not be used directly in documents intended for blind interchange. Instead of using PUA codepoints directly in the document content, entity references should be used. This will make it easier for receiving parties to find out what PUA characters are used in a document and where possible codepoint clashes with local use on the receiving side occurs.
This mechanism is rather weak in cases where DOM trees or parsed XML fragments are exchanged, which might be increasingly the case. The best an application can do here is to treat any occurrence of a PUA character only in the context of the local document and use the properties provided through the <char> element as a handle to the character in other contexts.