Author: Espen S. Ore
National Library of Norway, Oslo Division
Date: 2002-08-27
Language identification in the TEI P4 is now defined in the TEIHeader and further in Chapter 4 under code switching.
Chapter 4 recommends that language codes should be taken from ISO 639 two or three letter code sets and from SIL Ethnologue where a language is missing from ISO 639.
Chapter 4 in TEI P4 suggests that finer divisions within a language than what is available can be made by adding a suffix to the ISO 639 standard. It is not cler whether this is supposed to be an alternative to SIL ethnologue or whether we have the following order of recommendations:
Language identification in Chapter 25 (the chapter describing the WSD - Writing System Declaration) uses an iso639 attribute which is supposed only to holde values from ISO 639-1 or 639-2. There is no direct link here with a language value which would be a a natural candidate for the value of the id-attribute in one the language element in the TeiHeader's LangUsage element since there is no suggested formalism for languages not in the ISO 639.
If we let the WSD be for the time being and look at the <language> elements in the TeiHeader's <LangUsage>-element the following would be more flexible than the current suggestions and clearer:
<LangUsage> <language id="OBG-CYR"><langAuth name="ISO 639-1" code="cu"> <langAuth name="ISO 639-2" code="chu"/> <langAuth name="SIL" code="SLN"/> <p>Old Bulgarian, written in Cyrillic script.</p></language> </langUsage>
(It is not clear where the suggested language code "OBG" in the P4 comes from since it is neither valid SIL nor ISO 639-2.)
Below is a full DTD-example of how <language> and <langAuth> could be defined. I am basing this example on a Pizza-chef generated DTD on Aug. 27, 2002 using a mixed base with everything checked, so I have not ried to analyze which content/attribute entities this include:
<!ELEMENT language (#PCDATA | abbr | address | date | dateRange | dateStruct | expan | geogName | lang | measure | name | num | orgName | persName | placeName | rs | time | timeRange | timeStruct | add | app | corr | damage | del | orig | reg | restore | sic | space | supplied | unclear | oRef | oVar | pRef | pVar | formula | fw | handShift | distinct | emph | foreign | gloss | hi | mentioned | soCalled | term | title | ptr | ref | xptr | xref | caesura | c | cl | m | phr | s | seg | w | anchor | addSpan | delSpan | gap | alt | altGrp | certainty | fLib | fs | fsLib | fvLib | index | interp | interpGrp | join | joinGrp | link | linkGrp | respons | span | spanGrp | timeline | cb | lb | milestone | pb | langAuth)* > <!ATTLIST language group CDATA #IMPLIED grpPtr IDREF #IMPLIED depend CDATA #IMPLIED depPtr IDREF #IMPLIED corresp IDREFS #IMPLIED synch IDREFS #IMPLIED sameAs IDREF #IMPLIED copyOf IDREF #IMPLIED next IDREF #IMPLIED prev IDREF #IMPLIED exclude IDREFS #IMPLIED select IDREFS #IMPLIED ana IDREFS #IMPLIED id ID #IMPLIED n CDATA #IMPLIED lang IDREF #IMPLIED rend CDATA #IMPLIED usage CDATA #IMPLIED TEIform CDATA "language" > <!ELEMENT langAuth EMPTY> <!ATTLIST langAuth group CDATA #IMPLIED grpPtr IDREF #IMPLIED depend CDATA #IMPLIED depPtr IDREF #IMPLIED corresp IDREFS #IMPLIED synch IDREFS #IMPLIED sameAs IDREF #IMPLIED copyOf IDREF #IMPLIED next IDREF #IMPLIED prev IDREF #IMPLIED exclude IDREFS #IMPLIED select IDREFS #IMPLIED ana IDREFS #IMPLIED id ID #IMPLIED n CDATA #IMPLIED lang IDREF #IMPLIED rend CDATA #IMPLIED name CDATA #IMPLIED code CDATA #IMPLIED TEIform CDATA 'langAuth' >
This means that one can always add in as many language names from authorty lists as one wishes to while at the same time the id attribute in the language-element can be decided freely based upon what is meaningful for a project.
The wsd attribute has been removed from the <language> element since I have come back to my basic view that this kind of information should be in the header, either as free text (the original WSD gave an idea of formalism and so looked as if it could be used for something computerwise but I don't think it ever was, for real) or we could define some kind of simple structure/formalism.
The lang attribute within various TEI text elements is useful not only for marking language switching but also for identifying text-parts as being in any given (defined) language. One very simple use for this is to generate separate search/index functions for different languages.
The current P4, chapter 4 says:
“Any XML document may use an additional attribute xml:lang, the value of which is the identifier of a language from ISO 639 or registered with IANA. According to the XML Recommendation, the scope of this attribute is ‘considered to apply to all attributes and contents of the element where it is specified, unless overriden with an instance of xml:lang on another element within that content.’ (XML Recommendation, 2.12). Since the TEI DTD defines a great number of CDATA attributes with predeclared content in English, xml:lang cannot be used by TEI documents as intended in the XML recommendation. The current version of these Guidelines does not recommend use of the xml:lang attribute as a means of indicating language shifts; the TEI global lang attribute should instead be used for this purpose. This recommendation will be reviewed at the next revision of these Guidelines.”
I suggest that this is changed to:
These Guidelines does not recommend use of the xml:lang attribute as a means of indicating language shifts; the TEI global lang attribute should instead be used for this purpose.