vi. Languages and Character Sets

Table of contents

The documents which users of these Guidelines may wish to encode encompass all kinds of material, potentially expressed in the full range of written and spoken human languages, including the extinct, the non-existent, and the conjectural. Because of this wide scope, special attention has been paid to two particular aspects of the representation of linguistic information often taken for granted: language identification and character encoding.

Even within a single document, material in many different languages may be encountered. Human culture, and the texts which embody it, is intrinsically multilingual, and shows no sign of ceasing to be so. Traditional philologists and modern computational linguists alike work in a polyglot world, in which code-switching (in the linguistic sense) and accurate representation of differing language systems constitute the norm, not the exception. The current increased interest in studies of linguistic diversity, most notably in the recording and documentation of endangered languages, is one aspect of this long standing tradition. Because of their historical importance, the needs of endangered and even extinct languages must be taken into account when formulating Guidelines and recommendations such as these.

Beyond the sheer number and diversity of human languages, it should be remembered that in their written forms they may deploy a huge variety of scripts or writing systems. These scripts are in turn composed of smaller units, which for simplicity we term here characters. A primary goal when encoding a text should be to capture enough information for subsequent users to correctly identify not only the constituent characters, but also the language and script. In this chapter we address this requirement, and propose recommended mechanisms to indicate the languages, scripts and characters used in a document or a part thereof.

Identification of language is dealt with in vi.1. Language Identification. In summary, it recommends the use of pre-defined identifiers for a language where these are available, as they increasingly are, in part as a result of the twin pressures of an increasing demand for language-specific software and an increased interest in language documentation. Where such identifiers are not available or not standardized, these Guidelines recommend a method for documenting language identifiers and their significance, in the same way as other metadata is documented in the TEI header.

Standardization of the means available to represent characters and scripts has moved on considerably since the publication of the first version of these Guidelines. At that time, it was essential to explicitly document the characters and encoded character sets used by almost any digital resource if it was to have any chance of being usable across different computer platforms or environments, but this is no longer the case. With the availability of the Unicode standard, more than 128,000 different characters representing almost all of the world's current writing systems are available and usable in any XML processing environment without formality. Nevertheless, however large the number of standardized characters, there will always be a need to encode documents which use non-standard characters and glyphs, particularly but not exclusively in historical material. The second part of this chapter discusses in some detail the concepts and practice underlying this standard, and also introduces the methods available for extending beyond it, which are more fully discussed in 5 Characters, Glyphs, and Writing Modes.

TEI: Language Identification¶vi.1. Language Identification

Identification of the language a document or part thereof is written in is a crucial requirement for many envisioned usages of an electronic document. The TEI therefore accommodates this need in the following way:

A global attribute xml:lang is defined for all TEI elements. Its value identifies the language and writing system used.
The TEI header has a section set aside for the information about the languages used in a document: see further 2.4.2 Language Usage.

The value of the attribute xml:lang identifies the language (and, optionally, script) using a coded value. For maximal compatibility with existing processes, the identifier for the language must be constructed as in Best Current Practice 47²⁶. This same identifier has to be used to identify the corresponding language element in the TEI header, if one is present.

The first part of BCP 47 is called Tags for Identifying Languages, and proposes the following mechanism for constructing an identifier (tag) for languages as administered by the Internet Assigned Numbers Authority (IANA). The tag is assembled from a sequence of subtags separated by the hyphen (-, U+002D) character. It gives the language (possibly further identified with a sublanguage), a script, and a region for the language, each possibly followed by a variant subtag.

The authoritative list of registered subtags is maintained by IANA and is available at http://www.iana.org/assignments/language-subtag-registry. For a good general overview of the construction of language tags, see http://www.w3.org/International/articles/language-tags/, and for a practical step-by-step guide, see https://www.w3.org/International/questions/qa-choosing-language-tags.en.php.

In addition to the list of registered subtags, BCP 47 provides extensions that can be employed by private convention. The constructs provided can thus be used to generate identifiers for any language, past and present, in any usage in any area of the world. If such private extensions are used within the context of the TEI, they should be documented within the language element of the TEI header, which might also provide a prose description of the language described by the language tag.

While language, region, and script can be adequately identified using this mechanism, there is only very rough provision to express a dimension of time for the language of a document; those codes provided (e.g. grc for ‘Greek, Ancient (to 1453)’) might not reflect the segments appropriate for a text at hand. Text encoders might express the time window of the language used in the document by means of the extension mechanism defined in BCP 47 and relate that to a date element in the corresponding language section of the TEI header.

Equivalences to language identifiers by other authorities can be given in the language section as well, but no formal mechanism for doing so has been defined.

The scope of the language identification extends to the whole subtree of the document anchored at the element that carries the xml:lang attribute, including all elements and those attributes, if any, where a language might apply.²⁷

TEI: Characters and Character Sets¶vi.2. Characters and Character Sets

All document encoding has to do with representing one thing by another in an agreed and systematic way. Applied to the smallest distinctive units in any given writing system, which for the moment we may loosely call ‘characters’, such representation raises surprisingly complex and troublesome issues. The reasons are partly historical and partly to do with conceptual unclarities about what is involved in identifying, encoding, processing and rendering the characters of a natural language.

TEI: Historical Considerations¶Historical Considerations

When the first methods of representing text for storage or transmission by machines were devised, long before the development of computers, the overriding aim was to identify the smallest set of symbols needed to convey the essential semantic content, and to encode that symbol set in the most economical way that the storage or transmission media allowed. The initial outcome were systems that encoded only such content as could be expressed in uppercase letters in the Latin script, plus a few punctuation marks and some ‘control characters’ needed to regulate the storage and transmission devices. Such encodings, originally developed for telegraphy, strongly influenced the way the pioneers of computing conceived of and implemented the handling of text, with consequences that are with us still.

For many years after the invention of computers, the way they represented text continued to be constrained by the imperative to use expensive resources with maximal efficiency. Even when storage and processing costs began their dramatic fall, the Anglo-centric outlook of most hardware designers and software engineers hampered initiatives to devise a more generous and flexible model for text representation. The wish to retain compatibility with ‘legacy’ data was an additional disincentive. Eventually, tension in East Asia between commitment to technological progress and the inability of existing computers to cope with local writing systems led to decisive developments. Japanese, Korean, and Chinese standards bodies, who long before the advent of computers had been engaged in the specification of character sets, joined with computer manufacturers and software houses to devise ways of mapping those character sets to numeric encodings and processing the resulting text data.

Unfortunately, in the early years there was little or no co-ordination among either the national standards bodies or the manufacturers concerned, so that although commercial necessity dictated that these various local standards were all compatible with the representation of US-American English, they were not straightforwardly compatible with one another. Even within Japan itself there emerged a number of mutually incompatible systems, thanks to a mixture of commercial rivalry, disagreements about how best to manage certain intractable problems, and the fact that such pioneering work inevitably involved some false starts, leading to incompatibilities even between successive products of the same bodies. Roughly at the same time, and for similar reasons, multiple and incompatible ways of representing languages that use Cyrillic scripts were devised, along with methods of encoding ancient writing systems which inevitably could not aim for compatibility with other writing systems apart from basic Latin script. Many of the earliest projects that fed into the TEI were shaped in this developmental phase of the computerized representation of texts, and it was also the context in which SGML was devised and finalized.

SGML had of necessity to offer ways of coping with multiple writing systems in multiple representations; or rather, it provided a framework within which SGML-compliant applications capable of handling such multiple representations might be developed by those with sufficient financial and personnel resources (such as are seldom found in academia). Earlier editions of these Guidelines offered advice on character set and writing system issues addressed to the condition of those for whom SGML was the only feasible option. That advice is here substantially altered because of two closely-related developments: the availability of the ISO/Unicode character set as an international standard, and the emergence of XML and related technologies which are committed to the theory and practice of character representation which Unicode embodies.

TEI: Terminology and Key Concepts¶Terminology and Key Concepts

Before the significance of Unicode and the implications of the association between XML and Unicode can be adequately explained, it is necessary to clarify some key concepts and attempt to establish an adequately precise terminology for them.

Figure 16.. Examples of the latin a, in both lower and upper case, rendered with different fonts.

The word ‘character’ will not of itself take us very far towards greater terminological precision. It tends to be used to refer indiscriminately both to the visible symbol on a page and to the letter or ideograph which that symbol represents, two things that it is essential to keep conceptually distinct. The visible symbol obviously has some aspects by which we interpret it as representing one character rather than another; but its appearance may also be significantly determined by features that have no effect on our notion of which character in a writing system it represents. A familiar instance is the lowercase a, which in printed texts may be represented either by a ‘single storey’ symbol (cf. figure 1 in the examples from URW Gothic L on the bottom row) or by a ‘two storey’ version (as in figure 1 in the examples from Umpush, or URW Bookman L Demi Bold). We say that the single and double-storey symbols both represent one and the same the same abstract character a using two different glyphs. Similarly, an uppercase A in a serif typeface has additional strokes that are absent from the same letter when printed using a sans-serif typeface, so that once again we have differing glyphs standing for the same abstract character. The distinction between abstract characters and glyphs is fundamental to all machine processing of documents.

In most scholarly encoding projects, the accurate recording of the abstract characters which make up the text is of prime importance, because it is the essential prerequisite of digitizing and processing the document without semantic loss. In many cases (though there are important exceptions, to be touched on shortly) it may not be necessary to encode the specific glyphs used to render those abstract characters in the original document. An encoding that faithfully registers the abstract characters of a document allows us to search and analyse our document's content, language, and structure, and to access its full semantics. That same encoding, however, may not contain sufficient information to allow an exact visual representation of the glyphs in the source text or manuscript to be recreated.

The importance of this distinction between information content and its visual representation is not always immediately apparent to people unused to the specific complexities of text handling by machine. Such users tend to ask first what (in order of conceptual priority) should actually be their very last question: how do I get a physical image that looks like character x in my source document to appear on to the screen or the output page? Their first question should in fact be: how can I get an abstract representation of character x into my encoded document in a way that will be universally and unambiguously identifiable, no matter what it happens to look like in printout or on any particular display? And occasionally the response they receive as a result of their misguided initial question is a custom ‘solution’ that satisfies their immediate rendering wishes at the price of making their underlying document unintelligible to other users (or even to the original user in other times and places) because it encodes the abstract character in an idiosyncratic way.

That said, there will certainly be documents or projects where it is a matter of scholarly significance that the compositor or scribe chose to represent a given abstract character using one particular glyph or set of strokes rather than a semantically-equivalent but visually distinct alternative, and in that case the specific appearance of the form will have to be encoded in one way or another. But that encoding need not (and in most cases will not) involve a notation that visually resembles the original, any more than italicized text in an original document will be represented by the use of italic characters in the encoded version.

A collection of the abstract characters needed to represent documents in a given writing system is known as a character set, and the character set or character repertoire of a processing or rendering device is the set of abstract characters that it is equipped to recognize and handle appropriately. There is, however, a subtle distinction between these two parallel uses of the same term, involving one more key concept which it is essential to grasp. The character set of a document (or the writing system in which it is recorded) is purely a collection of abstract characters. But the character set of a computing device is a set of abstract characters which have been mapped in a well-defined way to a set of numbers or code points by which the device represents those abstract characters internally. It can therefore be referred to as a coded character set, meaning a set of abstract characters each of which has been assigned a numerical code point (or in some instances a sequence of code points) which unambiguously identifies the character concerned.

It is now possible to use this terminology to say what Unicode is: it is a coded character set, devised and actively maintained by an international public body, where each abstract character is identified by a unique name and assigned a distinctive code point.²⁸ Unicode is distinguished from other coded character sets by its (current and potential) size and scope; its built-in provision for (in practical terms) limitless expansion; the range and quality of linguistic and computational expertise on which it draws; the stability, authority, and accessibility it derives from its status as an international public standard; and, perhaps most importantly, the fact that today it is implemented by almost every provider of hardware and software platforms worldwide.

TEI: Abstract Characters, Glyphs, and Encoding Scheme Design¶Abstract Characters, Glyphs, and Encoding Scheme Design

The distinction between abstract characters and glyphs can be crucial when devising an encoding scheme. When performing searches, text retrieval, or creating concordances, users of electronic text will expect the system to recognize and treat different glyphs as instances of the same character; but when perusing the text itself they may well expect to see glyph variants preserved and rendered. When encoding a pre-existing text, the encoder should determine whether a particular letter or symbol is a character or a glyphic variant. The Unicode Consortium and an ISO work group (ISO/IEC JTC1 SC2/WG2) have developed a detailed model of the relationship between characters and glyphs. This model, presented in Unicode Technical Report 17: Character Encoding Model, is the underpinning of much standards work since, including the current chapter.

The model makes explicit the distinction between two different properties of the components of written language:

their content, i.e. its meaning and phonetic value (represented by characters)
their graphical appearance (represented by glyphs).

When searching for information, a system generally operates on the content aspects of characters, with little or no attention to their appearance. A layout or formatting process, on the other hand, must of necessity be concerned with the exact appearance of characters. Of course, some operations (hyphenation for example) require attention to both kinds of feature, but in general the kind of text encoding described in these Guidelines tends to focus on content rather than appearance (see further 3.3 Highlighting and Quotation).

An encoder wishing to record information about which glyphs are present in a given document may do so at either or both of two levels:

the level of character encoding, using an appropriate Unicode code point to represent the glyph concerned
the markup level, with the glyph indicated via appropriate elements or attributes

The encoding practice adopted may be guided by, among other things, an assessment of the most frequent uses to which the encoded text will be put. For example, if recognition of identical characters represented by a variety of glyphs is the main priority, it may be advisable to represent the glyph variations at markup level, so that the character value can be immediately exposed to the indexing and retrieval software. Plainly, an encoding project will need to consider such issues carefully and document the outcome of their deliberations in their TEI customization file (or other local encoding documentation) to ensure encoding consistency. Using Unicode code points to represent glyph information requires that such choices be documented in the TEI header. Such documentation cannot of itself guarantee proper display of the desired glyph but at least makes the intention of the encoder discoverable.

At present the Unicode Standard does not offer detailed specifications for the encoding of glyph variations. These Guidelines do give some recommendations; some discussion of related matters is given in 11 Representation of Primary Sources, and 5 Characters, Glyphs, and Writing Modes offers some features for the definition of variant glyphs.

TEI: Entry of Characters¶Entry of Characters

The entry of characters was much more complicated before the near-universal adoption of Unicode, for which there are Input Method Editors (IMEs) available in most languages and fonts that provide glyphs for the full range of the Unicode specification. In those rare situations where there is difficulty entering the specific character you want, or some problem representing it on the system you are working in, Numeric Character References (NCRs) should be used. These take the general form &#D; where D is an integer representing the code point of the character in base 10, or &#xH;, where H is the code point in hexadecimal notation. Every XML processor is capable of recognising NCRs and replacing them with the required code point value without needing access to any additional data. The disadvantage of NCRs as a means of entering, representing and proofing character data is that most human beings find them anything but ‘readable’ and it is all too easy for the wrong character to be entered in error and retained undetected. Where characters are not defined in Unicode, these Guidelines provide advice on the strategies available for handling their representation in Chapter 25 Representation of non-standard Characters and Glyphs.

TEI: Output of Characters¶Output of Characters

The rendering of the encoded text is a complicated process that depends largely on the purpose, external requirements, local equipment and so forth, it is thus outside the scope of coverage for these Guidelines.

It might however nevertheless be helpful to put some of the terminology used for the rendering process in the context of the discussion of this chapter. As was mentioned above, Unicode encodes abstract characters, not specific glyphs. For any process that makes characters visible, however, concrete, specifically designed glyph shapes have to be used. For a printing process, for example, these shapes describe exactly at which point ink has to be put on the paper and which areas have to be left blank. If we want to print a character from the Latin script, besides the selection of the overall glyph shape, this process also requires that a specific weight of the font has been selected, a specific size and to what degree the shape should be slanted. Beyond individual characters, the overall typesetting process also follows specific rules of how to calculate the distance between characters, how much whitespace occurs between words, at which points line breaks might occur and so forth.

If we concern ourselves only with the rendering process of the characters themselves, leaving out all these other parameters, we will realize that of all the information required for this process, only a small amount will be drawn from the encoded text itself. This information is the code point used to encode the character in the document. With this information, the font selected for printing will be queried to provide a glyph shape for this character. Some modern font formats (e.g. OpenType) implement a sophisticated mapping from a code point to the glyph selected, which might take into account surrounding characters (to create ligatures where necessary) and the language or even area this character is printed for to accommodate different typesetting traditions and differences in the usage of glyphs.

A TEI document might provide some of the information that is required for this process, for example by identifying the linguistic context with the xml:lang attribute. The selection of fonts and sizes is usually done in a stylesheet, while the actual layout of a page is determined by the typesetting system used. Similarly, if a document is rendered for publication on the Web, information of this kind can be shipped with the document in a stylesheet.²⁹

TEI: Unicode and XML¶Unicode and XML

XML was designed with Unicode in mind as its means of representing abstract characters. It is possible to use other character encoding schemes, but in general they are best avoided, as you run the risk of encountering compatibility issues with different XML processors, as well as potential difficulties with rendering their output. We recommend using the UTF-8 encoding, which for the Basic Latin range is identical to ASCII, and which uses a variable-length set of bytes to represent characters. It should be noted that it is not sufficient simply to declare in the XML Declaration that a document is in UTF-8 format. Doing so merely means that processors will treat the content therein as if it were UTF-8, and may fail to process the document if it is not. For further discussion of UTF-8, see the section below on Issues Arising from the Internal Representations of Unicode.

TEI: Special Aspects of Unicode Character Definitions¶Special Aspects of Unicode Character Definitions

TEI: Compatibility Characters¶Compatibility Characters

The principles of Unicode are judiciously tempered with pragmatism. This means, among other things, that the actual repertoire of characters which the standard encodes, especially those parts dating from its earlier days, include a number of items which on a strict interpretation of the Unicode Consortium's theoretical approach should not have been regarded as abstract characters in their own right. Some of these characters are grouped together into a code-point regions assigned to compatibility characters. Ligatures are a case in point. Ligatures (e.g. the joining of adjacent lowercase letters ‘s’ and ‘t’ or ‘f’ and ‘i’ in Latin scripts, whether produced by a scribal practice of not lifting the pen between strokes or dictated by the aesthetics of a type design) are representational features with no added semantic value beyond that of the two letters they unite (though for historians of typography their presence and form in a given edition may be of scholarly significance). However, by the time the Unicode standard was first being debated, it had become common practice to include single glyphs representing the more common ligatures in the repertoires of some typesetting devices and high-end printers, and for the coded character sets built into those devices to use a single code point for such glyphs, even though they represent two distinct abstract characters. So as to increase the acceptance of Unicode among the makers and users of such devices, it was agreed that some such pseudo-characters should be incorporated into the standard as compatibility characters. Nevertheless, if a project requires the presence of such ligatured forms to be encoded, this should normally be done via markup, not by the use of a compatibility character. That way, the presence of the ligature can still be identified (and, if desired, rendered visually) where appropriate, but indexing and retrieval software will treat the code points in the document as a simple sequential occurrence of the two constituent characters concerned and so correctly align their semantics with non-ligatured equivalents. Such ligatures should not be confused with digraphs (usually) indicating diphthongs, as in the French word "cœur". A digraph is an atomic orthographic unit representing an abstract character in its own right, not purely an amalgamation of glyphs, and indexing and retrieval software will need to treat it as such. Where a digraph occurs in a source text, it should normally be encoded using the appropriate code point for the single abstract character which it represents.

TEI: Precomposed and Combining Characters and Normalization¶Precomposed and Combining Characters and Normalization

The treatment of characters with diacritical marks within Unicode shows a similar combination of rigour and pragmatism. It is obvious enough that it would be feasible to represent many characters with diacritical marks in Latin and some other scripts by a sequence of code points, where one code point designated the base character and the remainder represented one or more diacritical marks that were to be combined with the base character to produce an appropriate glyphic rendering of the abstract character concerned. From its earliest phase, the Unicode Consortium espoused this view in theory but was prepared in practice to compromise by assigning single code points to precomposed characters which were already commonly assigned a single distinctive code point in existing encoding schemes. This means, however, that for quite a large number of commonly-occurring abstract characters, Unicode has two different, but logically and semantically equivalent encodings: a precomposed single code point, and a code point sequence of a base character plus one or more combining diacritics. Scripts more recently added to Unicode no longer exhibit this code-point duplication (in current practice no new precomposed characters are defined where the use of combining characters is possible) but this does nothing to remove the problem caused by the duplications from older character sets that have been permanently embodied in Unicode. Together with essentially analogous issues arising from the encoding of certain East Asian ideographs. This duplication gives rise to the need to practice normalization of Unicode documents. Normalization is the process of ensuring that a given abstract character is represented in one way only in a given Unicode document or document collection. The Unicode Consortium provides four standard normalization forms, of which the Normalization Form C (NFC) seems to be most appropriate for text encoding projects. The NFC, as far as possible, defines conversions for all base characters followed by one or more combining characters into the corresponding precomposed characters. The World Wide Web Consortium has produced a document entitled Character Model for the World Wide Web 1.0³⁰, which among other things discusses normalization issues and outlines some relevant principles. An authoritative reference is Unicode Standard Annex #15 Unicode Normalization Forms³¹.

It is important that every Unicode-based project should agree on, consistently implement, and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange. While different input methods may themselves differ in what normalization form they use, any programming language that implements Unicode will provide mechanisms for converting between normalization forms, so it is easy in practice to ensure that all documents in a project are in a consistent form, even if different methods are used to enter data.

TEI: Character Semantics¶Character Semantics

In addition to the Universal Character Set itself, the Unicode Consortium maintains a database of additional character semantics³². This includes names for each character code point and normative properties for it. Character properties, as given in this database, determine the semantics and thus the intended use of a code point or character. The database also contains information that might be needed for correctly processing this character for different purposes. It is an important reference in determining which Unicode code point to use to encode a certain character.

In addition to the printed documentation and lists made available by the Unicode consortium, the information it contains may also be accessed by a number of search systems over the Web (e.g. http://www.eki.ee/letter/). Examples of character properties included in the database include case, numeric value, directionality, and, (where applicable) status as a ‘compatibility character’³³. Where a project undertakes local definition of characters with code points in the PUA, it is desirable that any relevant additional information about the characters concerned should be recorded in an analogous way, as further discussed under 5 Characters, Glyphs, and Writing Modes.

TEI: Issues Arising from the Internal Representations of Unicode¶Issues Arising from the Internal Representations of Unicode

In theory it should not be necessary for encoders to have any knowledge of the various ways in which Unicode code points can be represented internally within a document or in the memory of a processing system, but experience shows that problems frequently arise in this area because of mistaken practice or defective software, and in order to recognize the resulting symptoms and correct their causes an outline knowledge of certain aspects of Unicode internal representation is desirable. There are three encodings of Unicode available for use: UTF-8, which uses 1–4 bytes per character, UTF-16, which uses 2–4, and UTF-32, which uses 4 bytes per character. Current practice for documents to be transmitted via the Web recommends only UTF-8.³⁴

Home

TEI: Encoding Errors Related to UTF-8¶Encoding Errors Related to UTF-8

The code points assigned by Unicode 3.0 and later are notionally 32-bit integers, and the most straightforward way to represent each such integer in computer storage would be to use 4 eight-bit bytes. However, many of the code points for characters most commonly used in Latin scripts can be represented in one byte only and the vast majority of the remainder which are in common use (including those assigned from the most frequently used PUA range) can be expressed in two bytes alone. This accounts for the use of UTF-8 and UTF-16 and their special place in the XML standard. UTF-8 and UTF-16 are ways of representing 32-bit code points in an economical way.

UTF-8 is a variable length encoding: the more significant bits there are in the underlying code point (or in everyday terminology the bigger the number used to represent the character), the more bytes UTF-8 uses to encode it. What makes UTF-8 particularly attractive for representing Latin scripts, explaining its status as the default encoding in XML documents, is that all code points that can be expressed in seven or fewer bits (the 127 values in the original ASCII character set) are also encoded as the same seven or fewer bits (and therefore in a single byte) in UTF-8. That is why a document which is actually encoded in pure 7-bit ASCII can be fed to an XML processor without alteration and without its encoding being explicitly declared: the processor will regard it as being in the UTF-8 representation of Unicode and be able to handle it correctly on that basis.

However, even within the domain of Latin-based scripts, some projects have documents which use characters from 8 bit extensions to ASCII, e.g. those in the ISO-8859-n series of encodings, and the way characters which under ISO-8859-n use all eight bits are encoded in UTF-8 is significantly different, giving rise to puzzling errors. Abstract characters that have a single byte code point where the highest bit is set (that is, they have a decimal numeric representation between 129 and 255) are encoded in ISO-8859-n as a single byte with the same value as the code point. But in UTF-8 code-point values inside that range are expressed as a two byte sequence. That is to say, the abstract character in question is no longer represented in the file or in memory by the same number as its code-point value: it is transformed (hence the T in UTF) into a sequence of two different numbers. Now as a side-effect of the way such UTF-8 sequences are derived from the underlying code-point value, many of the single-byte eight-bit values employed in ISO-8859-n encodings are illegal in UTF-8.

This complicated situation has a simple consequence which can cause great bewilderment. XML processors will effortlessly handle character data in pure 7-bit ASCII without that encoding needing to be declared to the parser, and will similarly accept documents encoded in an undeclared ISO-8859-n encoding if they happen to use no characters outside the strict ASCII subset of the ISO character sets; but the parse will immediately fail if an eight-bit character from an ISO-8859-n set is encountered in the input stream, unless the document's encoding has been explicitly and correctly declared. Explicitly declaring the encoding ought to solve the problem, and if the file is correctly encoded throughout, it will do so. But projects dealing with documents of sufficient age may find that they have to deal with some files encoded in UTF-8 along with others in, say, ISO-8859-1. Such encoding differences may go unnoticed, especially if the proportion of characters where the internal encodings are distinguishable is relatively small (for example in a long English text with a smattering of French words). These types of error may or may not manifest in actual processing errors, and may only become visible as ‘garbage’ characters in the eventual display of documents.

In projects that routinely handle documents in non-Latin scripts, everyone is well aware of the need to ensure correct and consistent encoding, so in such places mixed encoding problems seldom arise, and when they do are readily identified and remedied. Real confusion tends to arise, however, in projects which have a low awareness of the issues because they employ predominantly unaccented Latin characters, with only thinly-distributed instances of accented letters, or other ‘special characters’ where the internal representation under ISO-8859-n and UTF-8 are different (such as the copyright symbol, or, a frequent troublemaker where eventual HTML output is envisaged, the ‘non-breaking space’). Even, or especially, if such projects view themselves as concerned only with English documents, the close relationship between XML and Unicode means they will need to acquire an understanding of these encoding issues and develop procedures which assure consistency and integrity of encoding and its correct declaration, including the use of appropriate software for transcoding and verification.

Notes

Currently BCP 47 comprises two Internet Engineering Task Force documents, referred to separately as RFC 5646 and RFC 4647; over time, other IETF documents may succeed these as the best current practice.

↵

This excludes all attributes where a non-textual datatype has been specified, for example tokens, boolean values, dates, and predefined value lists.

↵

Although only Unicode is mentioned here explicitly, it should be noted that the character repertoire and assigned code points of Unicode and the ISO standard 10646 are identical and maintained in a way that ensures this continues to be the case.

↵

The World Wide Web Consortium provides recommendations for two standard stylesheet languages: either CSS or XSL could be used for this purpose.

↵

Available at http://www.w3.org/TR/charmod/.

↵

available at http://www.unicode.org/reports/tr15/

↵

http://www.unicode.org/ucd/

↵

For further details, see The Unicode Character Property Model (Unicode Technical Report #23), at http://www.unicode.org/reports/tr23/.

↵

See the W3C Internationalization document, Choosing & applying a character encoding at https://www.w3.org/International/questions/qa-choosing-encodings

↵

[English] [Deutsch] [Español] [Italiano] [Français] [日本語] [한국어] [中文]

TEI Guidelines Version 3.6.0. Last updated on 16th July 2019, revision daa3cc0b9. This page generated on 2019-07-16T15:20:49Z.

P5: Guidelines for Electronic Text Encoding and Interchange