Text Encoding Initiative

18. Character Sets, Diacritics, etc.


With the advent of XML and its adoption of Unicode as the required character set for all documents, most problems previously associated with the representation of the divers languages and writing systems of the world are greatly reduced. For those working with standard forms of the European languages in particular, almost no special action is needed: any XML editor should enable you to input accented letters or other `non-ASCII' characters directly, and they should be stored in the resulting file in a way which is transferable directly between different systems, whether as Unicode characters or as character entity references.

For compatability with other older systems, however, the TEI Lite DTD includes declarations for a number of the most widely used character entities, so that such characters may be entered and saved as character mnemonics.

You may use your own entity names in TEI-conformant files, if you wish and if you provide entity declarations for them, mapping the name to the appropriate Unicode value. The standard names (though long-winded) have the advantage of clarity; the characters intended are reasonably clear to any speaker of English who recognizes that a character is being named, often even without recourse to any list. This is not true of many older schemes for representing accented characters.

When the character you need does not appear in the public entity sets, you may wish to generate a name using the same naming conventions used in ISO public entity sets, as described here:

digraphs
Form entity names for digraphs by appending the string lig to the letters forming the digraph. If a capitalized form is required, both letters are given in upper case (remember that case is usually significant in entity names). E.g.: aelig (æ), AElig (Æ) szlig (ß).
diacritics and accents
Form entity names for accented letters in most Western European languages by appending one of the following strings to the letter bearing the accent, which may be in upper or lower case.
umlaut
use uml for umlaut or trema: e.g. auml (ä), Auml (Ä), euml (ë), iuml (sic: ï), ouml (ö), Ouml (Ö), uuml (ü), Uuml (Ü).
acute
use acute for acute or stressed accent: e.g. aacute (á), eacute (é), Eacute (É), iacute (í), oacute (ó), uacute (ú).
grave
use grave for grave accent: e.g. agrave (à), egrave (è), igrave (ì), ograve (ò), ugrave (ù).
circumflex
use circ for circumflex: e.g. acirc (â), ecirc (ê), Ecirc (Ê), icirc (î), ocirc (ô), ucirc (û).
tilde
use tilde for tilde: e.g. atilde (ã), Atilde (Ã), ntilde (ñ), Ntilde (Ñ), otilde (õ), Otilde (Õ).
consonants
The following are recommended entity names for some special consonants found in Western European languages: ccedil (ç), Ccedil (Ç), eth (lowercase eth or Anglo-Saxon/Icelandic crossed d), ETH (uppercase eth), thorn (lowercase thorn), THORN (uppercase thorn), szlig (German s-z ligature or esszett, ß).
punctuation marks
The following are recommended entity names for some commonly found punctuation marks: ldquo (left double quotation mark, in shape of superscript 66), rdquo (right double quotation mark, superscript 99), mdash (one-em dash), hellip (horizontal ellipsis, three closely spaced dots), rsquo (right single quote, in shape of superscript 9).

Up: Contents Previous: 17. Technical Documentation Next: 19. Front and Back Matter



Date: (revised October 2004) Author: Lou Burnard (revised SPQR).
Copyright TEI 1995