Next | First | Previous TEI meets Unicode 9

Using Unicode in XML and other markup languages

  • The Unicode Consortium and the W3C jointly authored a document that outlines issues to be aware of when using Unicode in markup languages (Unicode Technical Report #20; W3C Note 15 December 2000)
  • This document has important recommendations of how to use Unicode in a markup context. The main issues are:
    • Linear versus structured documents
    • Conflict of markup constructs and control structures in the character encoding (e.g. line breaks, paragraph breaks) The document contains a list of characters that are unsutable for use in markup because of one or more of the following reasons
      • They are deprecated in the Unicode Standard. (e.g. they were introduced for compatibility with existing standards and should not be used in newly created documents)
      • They are unsupportable without additional data. (e.g Object Replacement Character, U+FFFC)
      • They are difficult to handle because they are stateful. (e.g bidirectional markers)
      • They are better handled by markup. (e.g. language tags)
      • They are undesirable because of conflict with equivalent markup. (e.g Fractions, super/subscript characters etc.)
  • The document mentioned gives a very detailed account of these problems and its recommendations should be implemented when using Unicode in TEI.
  • TEI should develop clear recommendations how to deal with dual presentation of information (e.g. in the character encoding and in markup) and how to avoid it.