TEI CE M 02TEI Character Set Work Group Meeting Minutes, 2003-11-05/06Nancy, France
Initials used for people present
- SB Syd Bauman
- MB Michael Beddow
- DB David Birnbaum
- LB Lou Burnard
- PD Patrick Durusau
- CW Christian Wittern
Meeting took place the afternoon of Wed 05 Nov 03 and all day Thu 06 Nov 03, at Centre National de la Recherche Scientifique, sponsored by Analyse et Traitement Informatique de la Langue Française (ATILF) .
Contents
Editor's Report
SB gave a very brief report on the status of P5 and, at WG's request a somewhat more detailed report on progress of SO with respect to ID/IDREF.
Languages
The WG discussed what, if anything, should happen to the lang attribute. It was quickly and unanimously agreed that it has to be possible to specify a language. A few options were discussed, one of which was the idea that the content of lang should be an XPointer, which could point to the <langUsage> , to a project language description file (which might be an <ihs> ?), or might (of course) not point anywhere (although the Guidelines may deliberately choose not to mention this).
While this idea has some merit, the WG eventually decided on the use of xml:lang (details below), whose value (we believe) definitional cannot be an XPointer.
Currently, there are two separate but very similar mechanisms for indicating the language of elements in TEI XML documents: the TEI lang attribute and the xml:lang attribute. This is obviously not a desirable situation, mostly because many users get confused, but also for other reasons including the potential for conflict.
While the TEI has more control over the semantics of lang, the xml:lang attribute is an integral part of XML, and thus a) cannot be removed from the picture, and b) is more likely to be supported by software.
- it refers to attribute values in addition to content (the ‘scoping’ problem)
- you can refer to only one authority
So our final recommendation is to get rid of lang completely and to use xml:lang instead. The value of xml:lang will (per XML spec per RFC 3066) be either an ISO 639 2- or 3-letter code, an ‘i-’ prepended to an IANA code, or an ‘x-’ prepended to a user-specified code. When an ‘x-’ user-specified code is used, a corresponding <language> element in <teiHeader> with an n attribute whose value matches the value of the xml:lang attribute's ‘x-’ value is required. (For ISO and IANA codes it is optional.)
After some discussion, the WG decided to recommend that xml:lang be required on <text> . Failing that, at least it should appear on <text> in all the examples.
It may be best to move the discussion of xml:lang and <langUsage> from HD to CH, with a reference in HD. Eventually the WG rejected this idea though, and agreed to keep discussion in CH.
Discussion of CE W 06
Discussion of whether the properties in <char> and <glyph> should be structured into normative and non-normative (as with original DTD frag), or generic (as per LB's suggestion).
Agreed that value of ucs attribute must be hex number, not binary representation. Decided to use the ‘U+hhhh’ notation, defined by the Unicode standard. 1
It is important that CE W 06 discuss the difference between Unicode assigned, not (yet) assigned, and PUA, and forbid use of not assigned.
We discussed whether a <char> must have a code-point (PUA or otherwise) or not. Eventually decided, based on CW's ‘broken software’ example (?) to strongly recommend the use of ucs, but not require it.
Agreed, now that ucs is not in a normative property, we no longer need to have a section for normative properties at all.
Agreed that CE W 06 needs to discuss scenario of a locally defined non-Unicode character becoming an approved Unicode character.
Decided to use <equiv> (the semantics of which will need to be expanded a bit) instead of <mapping> .
Discussed <note> v. <remarks> at end of <char> and <glyph> without resolution.
Decided to add an optional <gloss> after the <name> of the <char> or <glyph> .
Removed target of <char> and <glyph> .
Agreed to keep that the name of the naming element child of <char> and <glyph> should be <name> .
Agreed to put <charDesc> inside <encodingDesc> , editors to determine exact details.
Agreed that we cannot foresee uses of type on <desc> or <figure> , and, since this attribute is not present in the current TEI declaration for these elements, we removed them from the WSD version.
CE W 01
We agree that there is no longer any need to discuss how to encode which writing system you're using, as Unicode does it for you. However, we do need a paragraph explaining the differences between language, writing system, and character encoding. A footnote should explain that in the SGML world we needed a WSD to make up for the fact that character code-points were overloaded.
Part of our discussion revealed the importance of including a discussion of (and warning about) the consequences of using default attributes in your DTD, probably in the modification and extension parts.
Schedule
We would like to have drafts complete by 2003-12-06. CW would like to report on our progress (hopefully done) at TEI Council call in 2004-01.
The editors, on behalf of the TEI, thank the WG as a whole for its productive efforts, and particularly CW for his leadership of this group.