Technical Documentation

Although the focus of this document is on the use of the TEI scheme for the encoding of existing `pre-electronic' documents, the same scheme may also be used for the encoding of new documents. In the preparation of new documents (such as this one), XML has much to recommend it: the document's structure can be clearly represented, and the same electronic text can be re-used for many purposes — to provide both online hypertext or browsable versions and well-formatted typeset versions from a common source for example.

To facilitate this, a small number of additional elements are included in TEI Lite as extensions of the main TEI DTD, for use in marking particular features of technical documents in general, and of XML-related documents in particular.

17.1. Additional Elements for Technical Documents

The following elements may be used to mark particular features of technical documents:

<eg>

contains a single short example of some technical topic being discussed, e.g. a code fragment or a sample of SGML encoding.

<code>

contains a short fragment of code in some formal language (often a programming language).

<ident>

contains an identifier of some kind, e.g. a variable name or the name of an XML element or attribute.

<gi>

contains a special type of identifier: an XML generic identifier, or element name.

<kw>

contains a keyword in some formal language.

<formula>

contains a mathematical or chemical formula, optionally presented in some non-XML notation. Attributes include:

notation: specifies the notation used to represent the body of the formula. Default value is tex, meaning the formula is represented using the TeX typesetting system.

The following example shows how these elements might be used to encode a passage from a tutorial introducing the Fortran programming language:

<p>It is traditional to introduce a language with a program like the
following:
<eg>
   CHAR*12 GRTG
   GRTG = 'HELLO WORLD'
   PRINT *, GRTG
   END
</eg></p>
<p>This simple example first declares a variable <ident>GRTG</ident>, in
the line <code>CHAR*12 GRTG</code>, which identifies <ident>GRTG</ident>
as consisting of 12 bytes of type <kw>CHAR</kw>.  To this variable,
the value <mentioned>HELLO WORLD</mentioned>
is then assigned. This is followed by a <kw>PRINT</kw> statement and an
<kw>END</kw> statement.

A formatting application, given a text like that above, can be instructed to format examples appropriately (e.g. to preserve line breaks, or to use a distinctive font). Similarly, the use of tags such as <ident> and <kw> greatly facilitates the construction of a useful index.

The <formula> element should be used to enclose a mathematical or chemical formula presented within the text as a distinct item. Since formulae generally include a large variety of special typographic features not otherwise present in ordinary text, it will usually be necessary to present the body of the formula in a specialized notation. The notation used should be specified by the notation attribute, as in the following example:

<formula notation="tex">
  \(E = mc^{2}\)
</formula>

The Tex notation is not pre-defined for the TEI Lite DTD; and must therefore be defined by a notation declaration within the DTD subset.

A particular problem arises when XML encoding is the subject of discussion within a technical document, itself encoded in XML. In such a document, it is clearly essential to distinguish clearly the markup occurring within examples from that marking up the document itself, and end-tags are highly likely to occur. One simple solution is to use the predefined entity reference lt to represent each < character which marks the start of an XML tag within the examples. A more general solution is to mark off the whole body of each example as containing data which is not to be scanned for XML mark-up by the parser. This is achieved by enclosing it within a special XML construct called a CDATA marked section, as in the following example:

<p>A list should be encoded as follows:
<eg><![ CDATA [
   <list>
   <item>First item in the list</item>
   <item>Second item</item>
   </list>
]]>
</eg>
The <gi>list</gi> element consists of a series of <gi>item</gi>
elements.

The <list> element used within the example above will not be regarded as forming part of the document proper, because it is embedded within a marked section (beginning with the special markup declaration <![CDATA[ , and ending with ]]>).

Note also the use of the <gi> element to tag references to element names (or generic identifiers) within the body of the text.

17.2. Generated Divisions

Most modern document production systems have the ability to generate automatically whole sections such as a table of contents or an index. The TEI Lite scheme provides an element to mark the location at which such a generated section should be placed.

<divGen>

indicates the location at which a textual division generated automatically by a text-processing application is to appear. Attributes include:

type: specifies what type of generated text division (e.g. index, table of contents, etc.) is to appear. Sample values include: index (an index is to be generated and inserted at this point), toc (a table of contents) figlist (a list of figures) tablist (a list of tables).

The <divGen> element can be placed anywhere that a division element would be legal, as in the following example:

<front>
<titlePage> ... </titlePage>
<divGen type="toc"/>
<div type="Preface"><head>Preface</head> ... </div>
</front>
<body> ... </body>
<back>
<div1><head>Appendix</head> ... </div1>
<divGen type="index" n="Index"/>
</back>

This example also demonstrates the use of the type attribute to distinguish the different kinds of division to be generated: in the first case a table of contents (a toc) and in the second an index.

When an existing index or table of contents is to be encoded (rather than one being generated) for some reason, the <list> element discussed in section 12. Lists should be used.

17.3. Index Generation

While production of a table of contents from a properly tagged document is generally unproblematic for an automatic processor, the production of a good quality index will often require more careful tagging. It may not be enough simply to produce a list of all parts tagged in some particular way, although extracting (for example) all occurrences of elements such as <term> or <name> will often be a good departure point for an index.

The TEI DTD provides a special purpose <index> tag which may be used to mark both the parts of the document which should be indexed, and how the indexing should be done.

<index>

marks a location to be indexed for some purpose. Attributes include:

level1: gives the main form of the index entry.
level2: gives the second-level form, if any.
level3: gives the third-level form, if any.
level4: gives the fourth-level form, if any.
index: indicates which index (of several) the index entry belongs to.

For example, the second paragraph of this section might include the following:

...
TEI lite also provides a special purpose <gi>index</gi> tag
<index level1="indexing"/>
<index level1="index (tag)" level2="use in index generation"/>
which may be used ...

The <index> element can also be used to provide a form of interpretive or analytic information. For example, in a study of Ovid, it might be desired to record all the poet's references to different figures, for comparative stylistic study. In the following lines of the Metamorphoses, such a study would record the poet's references to Jupiter (as deus, se, and as the subject of confiteor [in inflectional form number 227]), to Jupiter-in-the-guise-of-a-bull (as imago tauri fallacis and the subject of teneo), and so on.²

<l n="3.001">iamque deus posita fallacis imagine tauri
<l n="3.002">se confessus erat Dictaeaque rura tenebat</l>

This need might be met using the <note> element discussed in section in 7. Notes, or with the <interp> element discussed in section 16. Interpretation and Analysis. Here we demonstrate how it might also be satisfied by using the <index> element.

We assume that the object is to generate more than one index: one for names of deities (called dn), another for onomastic references (called on), a third for pronominal references (called pr) and so forth. One way of achieving this might be as follows:

<l n="3.001">iamque deus posita fallacis imagine tauri
     <index index="dn" level1="Iuppiter" level2="deus"/>
     <index index="on" level1="Iuppiter (taurus)"
                       level2="imago tauri fallacis"/></l>
<l n="3.002">se confessus erat Dictaeaque rura tenebat
     <index index="pr"    level1="Iuppiter" level2="se"/>
     <index index="v"     level1="Iuppiter" level2="confiteor (v227)"/>
     <index index="mons"  level1="Dicte" level2="rura Dictaea"/>
     <index index="regio" level1="Creta" level2="rura Dictaea"/>
     <index index="v"     level1="Iuppiter (taurus)"
                          level2="teneo (v9)"/></l>

For each <index> element above, an entry will be generated in the appropriate index, using as headword the value of the level1 attribute, and as secondary keyword that of the level2 attribute, which contains the word cited in nominative form. The actual reference will be taken from the context in which the <index> element appears, i.e. in this case the identifier of the <l> element containing it.

Up: Contents Previous: 16. Interpretation and Analysis Next: 18. Character Sets, Diacritics, etc.

Date: (revised October 2004) Author: Lou Burnard (revised SPQR).
Copyright TEI 1995

Text Encoding Initiative

17. Technical Documentation

17.1. Additional Elements for Technical Documents

17.2. Generated Divisions

17.3. Index Generation