Electronic Textual Editing: Writing Systems and Character Representation [Christian Wittern ]
Contents
- Introduction
- Coded character sets and their encoding
- How to find Unicode characters?
- Visual content and information content
- Representing different stages in a writing system
Introduction
In a book printed on paper, the letters that make up the text are usually formed with black dots on a white surface. The specific layout selected for the page, the style, weight, size and family of the font chosen, with kerning and special treatment of characters running together as ligatures, aesthetically and practically form a unit, the creation of which has become an art form in its own right.
The same is true for written manuscripts, which are even more idiosyncratic in its appearance and in its unit of form and appearence.
If such a text is to be brought into machine readable electronic form — a process which as a whole has been called text encoding — this unit of form and appearance has to be broken up into several layers. In one of these layers, the letters of the text will be represented with numbers assigned to them. The process of assigning numbers to letters or characters is called character encoding, and does not always work as might be expected. Upper and lower case letter forms for example, although representing the same letter, end up being assigned separate numbers. Another layer of this digitization process deals with the structure of the text, its division into words, sentences, paragraphs, sections, chapters, books and so on. This is the domain of descriptive markup, but work on this layer has also be called text encoding, this time in a more specific sense. There is yet another layer, which captures the shapes and forms that have to be used to recreate the shapes of the letters from the numbers that have been assigned in the first step. This is the layer of style encoding.
For all these layers exist a competing and confusing variety of different approaches. Some word processing applications seemingly combine them all together. The first and third layer are often lumped together, since the characters encoded are not directly visible, at least not in a comprehensible form. They have to be represented with shapes, which can than easily be mistaken for being inherent in that layer. Levels two and three are also frequently not differentiated; in many typesetting systems, the style encoding is applied directly to the appropriate sections, without the separation between the logical structure of the text and the form of presentation.
While text encoding in general has been discussed in many of the other chapters, style encoding, which belongs largely to the realm of the text processing, that is, dealing with the encoded text to produce some useful results, is largely outside of the scope of this book. This chapter will deal with the most basic, lowest level of these layers, the character encoding.
Coded character sets and their encoding
To represent characters in digital form, characters have to be enumerated and mapped to numbers. While in the 60's and 70's of the last century some countries and big companies started to create encodings of their own (i.e. ASCII, EBCDIC, ISO 646, JIS), in the late 80's it was realized that a universal character encoding was necessary to accommodate the needs of global communication and enhance interoperability.
In two separate efforts an ISO working group and an industry consortium (the Unicode Consortium) started working towards an universal character set. While their objectives were different at the outset, it was soon realized that having two competing universal character sets was not desirable, so attempts were started at merging these efforts. Although they still operate separately, the characters encoded by ISO 10646 and by Unicode have the same name and numeric values (code points) assigned to them. They strive to provide the ability to encode all characters used for all written languages in current usage in the world, and more and more historical writing systems are added. While the universal character set is still under development, both maintaining parties are committed to keep them synchronized. For simplicities sake, I will discuss and mention only Unicode in the following, but the corresponding ISO 10646 is always implied. 1
Unicode defines and encodes abstract characters. These are identified by the names given, for example LATIN CAPITAL LETTER J or DEVANAGARI LETTER DHA and given corresponding numeric values, U+0927. This does not specify the visual representation of the character as it might appear on screen or paper, e.g. the exact shape of the glyph used, its size, weight, kerning and so on. These visual aspects are completely outside of the realm of character encoding.
Since Unicode had to maintain compatibility to existing national and vendor specific character encodings, it started out as a superset of these earlier character sets. Any encoded entity that existed in these sets of characters was also incorporated into Unicode, regardless of its conformance with the Unicode design principles. To give just one example of what kind of practical problems are encountered due to this fact, units of measurement are frequently expressed with ordinary letters, for example the Angstrom unit (Å) was assigned the Unicode value ANGSTROM SIGN (U+212B), although the LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5) would have been equally suitable for this purpose. This is just one of several types of duplicated encodings in Unicode of which text encoders have to be aware. Implications of this fact and recommendations for text encoding projects derived thereof will be discussed in a later section.
How to find Unicode characters?
Unicode characters are identified by their names; these names are in turn mapped to the numeric values used to encode them. The best strategy to find a character is therefore to search through the list of characters. As the examples of Unicode character names given so far will have shown, a name is usually derived by naming the components of a character, combining them if necessary in a systematic way. While the specific names for some of the diacritical marks may not be obvious, a look at the section where these are defined (U+0300 to U+0362) will quickly reveal how they are named in Unicode.
Not all characters however, have individual names. As of version 3.2, which is current at the time of this writing, Unicode defines more than 94000 characters. More than 70000 of these are Han Characters used for Chinese, Japanese, Korean and old Vietnamese, and another 12000 are precomposed Hangul forms, all these are only identified by generic names, which do not allow identification of characters. However, there is still a large number of characters that are identified by individual names. Such characters can be looked up in the character tables of TUS 3.0 or ISO 10646, but this tends to be rather cumbersome. Unicode provides an online Version of its character database 2 . There is also an online query form provided by the Institute of the Estonian language ( http://www.eki.ee/letter ), which allows more convenient searches.
Due to the history of Unicode, many characters have more than one possible expression in Unicode. Frequently used accented letters, for example, have been given separate Unicode values (TUS 3.0 calls these ‘precomposed characters’), although the accents and the base letters have been also encoded, so that these could also be used to create the same character. The character LATIN SMALL LETTER U WITH DIAERESIS (U+00FC ü) could also be expressed as a sequence of LATIN SMALL LETTER U (U+0075 u) and COMBINING DIAERESIS (U+0308). We will return to this problem in a moment.
Encoding forms of Unicode
In order to understand how Unicode is encoded and stored in computer files, a short excurse into some of the technical details can not be avoided. This section is especially intended for encoders, who run into trouble with the default mechanism of their favorite software platform, which is usually designed to hide these details.
Unicode allows the encoding of about 1 Million characters. This is the theoretical upper limit, but at present less than 10% of this code space is actually used. The code space is arranged in 17 ‘planes’ of 65536 code points each, with Plane 0, the ‘Basic Multilingual Plane (BMP)’ being the one where most characters are defined. 3
In order to store the numeric values of the code points in a computer, they have to be serialized. Unicode defines two encoding forms for serialization: UTF-8 and UTF-16.
UTF-16 simple stores the numerical value as a 16 bit integer, while characters with higher numerical values are expressed using two UTF-16 values from a range of the BMP set aside for this purpose, they are called ‘Surrogate Pairs’. Since most computers store and retrieve numeric values in bundles of 8 bit (‘bytes’), the 16 bits of one UTF-16 value have to be stored in two separate bytes. Conventions of whether the byte with the higher value (‘Big-Endian’) or the lower value (‘Little-Endian’) differ in the same way and for the same reasons as the egg openers in Gulliver's Travels, for that reasons, there are two storage forms of UTF-16: UTF-16-LE and UTF-16-BE. If UTF-16 is used without any further specification, it is usually UTF-16-BE, which is the default for example on Microsoft Windows platforms.
UTF-8 avoids the whole issue of endianess by serializing the numbers in chunks of single bytes. To achieve this, it uses sequences of multiple single bytes to encode a Unicode numeric value. The length of such sequences depends on the value of the Unicode character; values less than 128 (the range of the ASCII or ISO 646 characters) are just one byte in length, which means they are identical to ASCII. This means that English text and also the tags used for markup do not differ in UTF-8 and ASCII. This is one of the reasons why UTF-8 is rather popular. It is also the default encoding for XML files in the absence of a specific encoding declaration and the recommended encoding to use. In UTF-8, most accented characters require a sequence of two bytes, East-Asian character need three, and the characters beyond the BMP need four or more bytes.
In most cases, there is no need to worry about the specific encoding form, except to make sure that the encoding declaration, which is optionally included in the first line of an XML file in the form <?xml version ="1.0" encoding="utf-8"?> does indeed faithfully reflect the actual encoding. Problems do occasionally arise with UTF-16 files read with the wrong endianess. The TEI-Emacs bundle, which is distributed on the CD-ROM accompanying this book, makes all attempts to act according to the encoding declaration. If this fails, because of a mismatch between the encoding used and the encoding declared, it is still possible to force Emacs into opening a file with the desired encoding and correct the mismatch. 4
Visual content and information content
- Multiple representations exist for the same character
- Similar, but semantically different characters exist
- Visually different characters need to be encoded as identical abstract forms
- Appearance and characters have to be separately encoded
Multiple representations of characters
As briefly mentioned above, some Unicode characters have multiple representations. It is absolutely necessary, that (1) a text encoding project decides which of these different representations to use, (2) documents this decision in the project's encoders handbook and in the section <encodingDesc> of the <teiHeader> and (3) apply it consistently in the encoding process. The Unicode Standard Annex #15 Unicode Normalization Forms 5 explains the problem in greater detail and gives some recommendations. In many cases, it is most convenient to use the shortest possible sequence of Unicode characters (‘NFC’ in the notation of the Unicode document). This will use precomposed accented characters where they exist and combining sequences in other cases. 6
Similar, but semantically different characters
Sometimes it is difficult to decide which character to encode by simply looking at its shape. A ‘dash’ character for example might look identical to a ‘hyphen’ character as well to a ‘minus’ sign. The decision which one is going to be used needs to be based on the function of the character in the text and the semantics of the encoded character. In Unicode, there is for example a HYPHEN-MINUS (U+002D), a SOFT HYPHEN (U+00AD) and a NON-BREAKING HYPHEN (U+2011) and of course the HYPHEN (U+2010), not to mention the subscript and suberscript variants (U+208B and U+207B). There are also compatibility forms at SMALL HYPHEN-MINUS (U+FE63) and FULLWIDTH HYPHEN-MINUS (U+FF0D), but these should never be considered for newly encoded texts, since they exist only for the sake of roundtrip conversion with legacy encodings. The ‘hyphen’ character is sometimes lumped together with a ‘minus’ character, but this is basically a legacy of ASCII, which has been carried over to Unicode, there now exists also MINUS SIGN (U+2212) plus some compatibility forms. As for the ‘dash’ character, Unicode gives four encodings in sequence up front: FIGURE DASH (U+2012), EN DASH (U+2013), EM DASH (U+2014) and HORIZONTAL BAR (U+2015). The last one might be difficult to find by just looking at the character name, but as its old name ‘QUOTATION DASH’ reveals, this is also a dash character. TUS 3.0 has a note on this character explaining ‘long dash introducing quoted text’, while the note for U+2014 says ‘may be used in pairs to offset parenthetical text’.
While not every case is as complicated as this one, it should be obvious that any decision should be made based on all possible candidates. To further complicate the case of this specific example; if a text has the usage of a dash that fits the description of U+2015, the decision has to be made, whether to encode the quotation with appropriate markup ( <q> or <quote> come to mind) and encode the fact that a dash was used to set the quotation off with the ‘rend’ attribute or simply retain the character in the encoded text.
Visually different forms of identical abstract characters
This issue is most important in languages that do contextual shaping, like Arabic or Indian languages, but there is also one such case in Greek: The character GREEK SMALL LETTER SIGMA (U+03C3) can take different shapes, depending on its occurrence within a word or at the end of a word. Unicode also defines GREEK SMALL LETTER FINAL SIGMA (U+03C2), which is in violation of the principle of encoding only abstract characters, but had to be introduced to maintain compatibility with existing encoding forms for Greek. The encoder should be careful to use the standard glyphs, not the presentation forms (even if they exist, as in the case of Arabic for compatibility reasons). In the case of Greek, it has to be decided by a project and documented accordingly, how to encode the sigma character.
Separating appearance and encoding of characters
In some cases, visual and informational aspects of characters have been lumped together and encoded as separate characters in legacy encodings. This is the case for example with subscript or superscript characters, ligatures, characters in small capitals, fractions and so on. In all these cases, the special formatting, if necessary, should be achieved by suitable values of the ‘rend’ attribute, rather than trying to use one of the characters of this categories that happened to make it into Unicode. To make this unmistably clear: Using a separate character like SUBSCRIPT TWO (U+2082) ₂ will obscure the fact that it is a digit with the value two and require special processing, for example to prepare such a text for indexing and search programs, while H<seg rend="sub">2</seg>O could be used to encode this information independently of its desired rendering. Obviously, if more sophisticated formulae are required, markup vocabularies like MathML would be better candidates. A discussion of this problem can be found in the Guidelines Section 22.2.
Representing different stages in a writing system
In many scholarly editions, the requirement is not only to produce a text as faithful to the original as possible, but also to produce a derived version, using the modern conventions of the writing system to appeal to contemporary readers. In some cases, these differences involve mere variations in the orthography above the level of characters, but in other cases, shifts in the characters used to represent a word occur. To give just one example for the writing system of English, there was a shift in the usage of "i" and "j" as well as "u" and "v" in prints since the early 17th century, so that e.g. ‘ivory’ was written as ‘iuory’. As well as specific markup constructs for handling such cases 7 , the TEI Guidelines describe in chapter 25 a general purpose mechanism for the definition and use of variant glyphs and characters, which is intended to make it easier and more convenient to encode both an original and a modern version of a text. 8 It should be noted however, that this covers only variation in the usage of characters and glyphs, rather than orthographic variation in general.