TEI CE M 02TEI Character Set Work Group Meeting Minutes, 2003-11-05/06Nancy, France

Initials used for people present

SB Syd Bauman
MB Michael Beddow
DB David Birnbaum
LB Lou Burnard
PD Patrick Durusau
CW Christian Wittern

Meeting took place the afternoon of Wed 05 Nov 03 and all day Thu 06 Nov 03, at Centre National de la Recherche Scientifique, sponsored by Analyse et Traitement Informatique de la Langue Française (ATILF) .

Editor's Report
Languages
Discussion of CE W 06
CE W 01
Schedule

Editor's Report

SB gave a very brief report on the status of P5 and, at WG's request a somewhat more detailed report on progress of SO with respect to ID/IDREF.

Languages

The WG discussed what, if anything, should happen to the lang attribute. It was quickly and unanimously agreed that it has to be possible to specify a language. A few options were discussed, one of which was the idea that the content of lang should be an XPointer, which could point to the <langUsage> , to a project language description file (which might be an <ihs> ?), or might (of course) not point anywhere (although the Guidelines may deliberately choose not to mention this).

While this idea has some merit, the WG eventually decided on the use of xml:lang (details below), whose value (we believe) definitional cannot be an XPointer.

Currently, there are two separate but very similar mechanisms for indicating the language of elements in TEI XML documents: the TEI lang attribute and the xml:lang attribute. This is obviously not a desirable situation, mostly because many users get confused, but also for other reasons including the potential for conflict.

While the TEI has more control over the semantics of lang, the xml:lang attribute is an integral part of XML, and thus a) cannot be removed from the picture, and b) is more likely to be supported by software.

There are 2 problems with using xml:lang:

it refers to attribute values in addition to content (the ‘scoping’ problem)
you can refer to only one authority

However, Council has accepted in principle this WGs previous recommendation that those TEI features that are, in P4, expressed via CDATA attributes (other than those that are essentially open-ended lists of tokens or other specific data-types like dates) be moved to elements in P5. Thus the scoping problem disappears.

So our final recommendation is to get rid of lang completely and to use xml:lang instead. The value of xml:lang will (per XML spec per RFC 3066) be either an ISO 639 2- or 3-letter code, an ‘i-’ prepended to an IANA code, or an ‘x-’ prepended to a user-specified code. When an ‘x-’ user-specified code is used, a corresponding <language> element in <teiHeader> with an n attribute whose value matches the value of the xml:lang attribute's ‘x-’ value is required. (For ISO and IANA codes it is optional.)

After some discussion, the WG decided to recommend that xml:lang be required on <text> . Failing that, at least it should appear on <text> in all the examples.

It may be best to move the discussion of xml:lang and <langUsage> from HD to CH, with a reference in HD. Eventually the WG rejected this idea though, and agreed to keep discussion in CH.

Discussion of CE W 06

Discussion of whether the properties in <char> and <glyph> should be structured into normative and non-normative (as with original DTD frag), or generic (as per LB's suggestion).

Agreed that value of ucs attribute must be hex number, not binary representation. Decided to use the ‘U+hhhh’ notation, defined by the Unicode standard. ¹

It is important that CE W 06 discuss the difference between Unicode assigned, not (yet) assigned, and PUA, and forbid use of not assigned.

We discussed whether a <char> must have a code-point (PUA or otherwise) or not. Eventually decided, based on CW's ‘broken software’ example (?) to strongly recommend the use of ucs, but not require it.

Agreed, now that ucs is not in a normative property, we no longer need to have a section for normative properties at all.

Agreed that CE W 06 needs to discuss scenario of a locally defined non-Unicode character becoming an approved Unicode character.

WG agreed to use the generic <property> element with a name attribute which holds the property name, and a unicode attribute with possible values ‘yes’ and ‘no’. Nope, eventually decided on

<!ELEMENT property ( ( unicodeName | localName ), value ) > <!ATTLIST property %a.global; type CDATA #IMPLIED >

. Although several WG members felt uneasy about the necessity of the type attribute, as it is hard to imagine in what case it would provide information not available via the name or other properties.

Decided to use <equiv> (the semantics of which will need to be expanded a bit) instead of <mapping> .

Discussed <note> v. <remarks> at end of <char> and <glyph> without resolution.

Decided to add an optional <gloss> after the <name> of the <char> or <glyph> .

Removed target of <char> and <glyph> .

Agreed to keep that the name of the naming element child of <char> and <glyph> should be <name> .

Agreed to put <charDesc> inside <encodingDesc> , editors to determine exact details.

Agreed that we cannot foresee uses of type on <desc> or <figure> , and, since this attribute is not present in the current TEI declaration for these elements, we removed them from the WSD version.

CE W 01

We agree that there is no longer any need to discuss how to encode which writing system you're using, as Unicode does it for you. However, we do need a paragraph explaining the differences between language, writing system, and character encoding. A footnote should explain that in the SGML world we needed a WSD to make up for the fact that character code-points were overloaded.

Action 1: DB write para on xml:lang for MB to put into CE W 01.2003-11-06

[Note: PD draft a list of topics in P4:2002 Ch 4 and CE W 01 (using current draft on the web, so as not to have to wait for the minor revisions MB is currently working on) for comparison 2003-11-30 ]

Part of our discussion revealed the importance of including a discussion of (and warning about) the consequences of using default attributes in your DTD, probably in the modification and extension parts.

Schedule

We would like to have drafts complete by 2003-12-06. CW would like to report on our progress (hopefully done) at TEI Council call in 2004-01.

The editors, on behalf of the TEI, thank the WG as a whole for its productive efforts, and particularly CW for his leadership of this group.

Notes

‘... an individual Unicode code point can be expressed as U+n, where n is four to six hexadecimal digits, using the digits 0-9 and uppercase letters A-F (for 10 through 15, respectively). There should be no leading zeros, unless the code point would have fewer than four hexadecimal digits’. Unicode standard 4.0, Preface, section 0.3 I.e., matches the Perlese regexp U\+([1-9A-F][0-9A-F]?)?[0-9A-F]{4}

Last recorded change to this page: 2007-09-16 • For corrections or updates, contact webmaster AT tei-c DOT org