Rules for Use of TEI Lite in CIC E-Text Projects

C. M. Sperberg-McQueen

Mark Olsen

John Price-Wilkin

Perry Willett

For the CIC Working Group on Electronic Texts

31 October 1996

This unpublished document is distributed privately for comment by friends and colleagues; it is not now a formal publication and should not be quoted in published material.

1 Introduction
2 Tag Sets
3 Encoding Practice
4 Quality Assurance
- 4.1 Validation
- 4.2 Proofreading
5 Technical Details
6 Other Documents

Status of This Document

This document is a draft specification of the encoding scheme to be used in electronic texts created by cooperative projects of the Committee on Institutional Cooperation (CIC). It has not yet been approved by the CIC or anyone else. It has been drafted by the authors named on the title page for consideration by the appropriate bodies, and should not be taken as a final product until those bodies have revised and approved it, and this note is removed.

On some topics, specific proposals are made in this document. Like the document as a whole, these proposals have not yet been approved formally and are subject to discussion and change. They are not final. Other topics are identified explicitly as open issues, on which discussion and decisions are needed; no proposals are made for dealing with open issues. A list of open issues appears in the appendix for convenient consultation.

The current partial draft of this document was prepared by C. M. Sperberg-McQueen on the basis of discussions with the other named authors. In completing and finishing the document, the following steps are expected:

this partial draft was placed on the network for consideration by the authors as a group
the draft was revised based on the comments of the authors
on 5 February 1996, the completed draft will be made available to the CIC working group on e-texts for their consideration (oops -- we're late)
based on their comments, the document will be revised
the revised document will be circulated to those involved in the CIC Initiative on Learning Technologies for their consideration
the document will be revised again to reflect the requirements of the Initiative on Learning Technologies as well as that of the Working Group on E-Texts
further review and revisions may be needed as the proposal for the American Corpus proceeds through the CIC

The following things remain to be done before this document is approved by the CIC working group on e-texts:

read TEI header chapter carefully to ensure we address at least all the issues of encoding practice discussed there
read TEI Lite documentation looking for topics which need to be addressed
fill out the explanatory sections
add examples of TEI headers with encoding levels declared (following the pattern now shown in Font Shifts)
settle the open issues (mostly indicated here by text in square brackets)

1 Introduction

[Discuss background: CIC American Corpus project, CIC, TEI, audience to be served by the e-text collection, expected work processes...]

The CICTEI encoding scheme is based on the TEI Lite subset of TEI, documented in Lou Burnard and C. M. Sperberg-McQueen, "TEI Lite: An Introduction to Text Encoding for Interchange" (document TEI U5), which can be found at http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei or http://www-tei.uic.edu/orgs/tei/intros/teiu5.html. Familiarity with the basics of SGML encoding and the TEI encoding scheme are assumed; readers who wish to learn the basics are referred to the TEI home page, especially to the list of tutorials and introductions.

2 Tag Sets

The following TEI tag sets are selected:

base tag set for prose
additional tag set for linking and alignment
additional tag set for simple analysis
additional tag set for tables, formulae, and graphics

Full documentation of these tag sets may be found in the TEI guidelines: Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC). Guidelines for Electronic Text Encoding and Interchange, ed. C. M. Sperberg-McQueen and Lou Burnard. Chicago, Oxford: Text Encoding Initiative, 1994.

3 Encoding Practice

This section describes CIC encoding practice in particular areas. In many areas, several different levels of practice are defined; unless otherwise stated, all electronic texts created by CIC projects will adhere at least to the lowest defined level. In some cases, different minimum standards may apply to electronic texts acquired by the CIC from other sources and edited into SGML form. Such differences are always noted; if nothing is said, no distinction is made between work created by the CIC and work acquired from other sources and mounted by the CIC on CIC-wide servers. Documents acquired from other sources in SGML form may or may not be modified to meet minimal CIC standards; if they are not so modified, they may be described as being at `encoding level 0'.

No global encoding levels are defined; a text may be at level 1 with respect to notes, and level 4 with respect to quotations. A full description of the encoding of a given text thus requires that its level be specified for each area defined here.

If documents need to be characterized in terms of single numbers, then the following overall characterizations may be used:

Level 1 is a minimal encoding, intended to provide as much information as is feasible with minimal cost for data entry, manual intervention, and quality control
Level 2 adds some information not present in level 1, but requires only minimal manual intervention
Level 3 includes further information not present in level 2 and may in some cases require manual intervention by intelligent readers of the text

If the text is at different levels in different respects, its overall level is the same as its lowest score in any individual aspect: a text which is at level 1 with respect to any area will be classified as level 1, etc.

In general, higher levels either are more exhaustive in identifying occurrences of specific textual features (e.g. quotation), or more complete in describing them, or both. The definition of the different levels thus always specifies both the recognition criteria for the element in question and the analytic detail to be supplied at each level.

The CICTEI encoding scheme requires that the encoding practice of each electronic text be documented in its TEI header. This document describes CIC encoding practice in prose, and also gives examples of elements in the TEI header which are to be used to document CIC practice. Since the encoding practice will be consistent for most texts produced by the CIC, the <encodingDesc> elements of most CIC texts can be substantially the same; the same boilerplate text can be used for all texts at a particular level. This document defines SGML entities for use in headers of CIC documents, which will make it easier to create the TEI headers. See examples below.

This document describes two versions of the CICTEI DTD: a `level-1' version which includes only the elements required by level-1 tagging, and a full version which includes all the elements selected. The level-1 DTD is intended only for training purposes.

3.1 TEI Header

In CIC electronic documents, the TEI header will have somewhat less flexibility than in the full TEI scheme. The primary goal is to enforce a higher minimum standard of documentation, and to make it easier for unintelligent software to identify an author, title, date, and source edition for any text. Headers which conform to the CICTEI specification will always also automatically conform to the TEI Lite specification and the base TEI specification.

It would also be possible to rewrite the header and provide new tags with unique names for what we regard as the critical bits of information. This is not done here because it seems unnecessary to depart from the basic TEI DTD in this way.

At the top level, all four parts of the header will be required, rather than optional. (Since much of the content will be constant across CIC documents, this will not impose as much burden on encoders as might be supposed.) Formally:

< 1 CIC TEI Header > =

 
<!ELEMENT teiHeader  - - (fileDesc, encodingDesc+, profileDesc+,
                          revisionDesc) > 
<!ATTLIST teiHeader          %a.global;
          type               CDATA               text
          creator            CDATA               #IMPLIED
          status             (new | update)      new
          date.created       CDATA               #IMPLIED
          date.updated       CDATA               #IMPLIED
          TEIform            CDATA               'teiHeader'    >

The <encodingDesc> and <profileDesc> elements will repeat only in the case of corpora and collections, and probably not often then; in normal cases, only one of each will appear.

3.1.1 The File Description

The title statement, extent statement, publication statement, and source description are all required. For normal texts, a single source edition must be identified. CICTEI documents will not reflect collation of multiple sources; at least, not without revision of this document. An edition statement and series statement will be supplied whenever appropriate (reissues or revisions of CIC texts, CIC texts created as part of a series). A notes statement will be provided only in exceptional cases.

Within the title statement, CIC texts will invariably give at least one title and at least one author (unknown and anonymous authors will be given as "Unknown" and "Anonymous", respectively). Where multiple titles are given, the first <title> element will be suitable for use by software as a short title or main title for the work. Other statements of responsibility, for editors, for sponsors, funders, and principal investigators of the various projects which create CIC texts, and for other miscellaneous forms of responsibility, will be given in a fixed sequence. [Is this necessary?]

Edition statements will use the <edition> and <respStmt> elements, not the unstructured series of <p> elements; series statements will similarly use the structured, not the unstructured, encoding defined in TEI Lite.

Publication statements will give the publishing information in a rigid sequence; the place of publication, terms of availability, and the date of publication will always be given. CIC texts will normally be given identification numbers by the CIC, and the DTD requires at least one (if none applies, it should be left blank).

The source description will normally take the form of a <biblFull> element; in the case of texts created in electronic form and therefore without a non-electronic source, the source description will take the following form:

 
<sourceDesc>
<p>Created in electronic form.</p>
</sourceDesc>

A standard CIC entity may be declared, to allow this to be abbreviated:

 
<sourceDesc>
&newtext;
</sourceDesc>

Formally:

< 2 Header Contents > =

 
<!ELEMENT fileDesc         - -  (titleStmt, editionStmt?, extent, 
                                publicationStmt, seriesStmt?, 
                                notesStmt?, sourceDesc) >
<!ATTLIST fileDesc           %a.global;
          TEIform            CDATA               'fileDesc'     >
<!ELEMENT titleStmt        - O  (title+, author+, 
                                (editor | sponsor | funder 
                                | principal | respStmt)*) >
<!ATTLIST titleStmt          %a.global
          TEIform            CDATA               'titleStmt'    >
<!ELEMENT editionStmt      - O  (edition, respStmt*) >
<!ATTLIST editionStmt          %a.global
          TEIform            CDATA               'editionStmt'    >
<!ELEMENT publicationStmt  - O  ((publisher | distributor | authority), 
                                 (pubPlace, address?, idno+, 
                                 availability, date)+)+>
<!ATTLIST publicationStmt      %a.global
          TEIform            CDATA               'publicationStmt' >
<!ELEMENT seriesStmt       - O  (title, (idno | respStmt)*)>
<!ATTLIST seriesStmt         %a.global
          TEIform            CDATA               'seriesStmt'   >

3.1.2 Encoding Description

The encoding description will be as full as possible, to ensure that scholarly users of CIC texts can readily discover the editorial principles which have governed their creation. In practice, the encoding description in most texts will consist of a series of entity references which expand to standard descriptions of the encoding levels defined elsewhere in this document.

Formally:

< 3 Encoding Description > =

 
<!ELEMENT encodingDesc  - - (projectDesc+, samplingDecl+, 
                            editorialDecl+, tagsDecl?, refsDecl*, 
                            classDecl*, p*) >
<!ATTLIST encodingDesc      %a.global
          TEIform           CDATA                   'encodingDesc' >

[Need sections in this document, and definitions of levels if applicable, for

sampling declaration
class declaration (or omit entirely?)
various parts of the editorial declaration:
- correction of apparent errors, and retention of correction info in the source
- normalization of spelling, and signaling of normalization in the source
- hyphenation
- segmentation of the text into orthographic sentences or other standard units
- provision of standard values for dates, numbers, etc.

]

3.1.3 Profile Description

3.1.4 Revision Description

The contents of this section will invariably be a <list> containing one <item> for each change or set of changes. The items should be formatted as in the following example:

 
<revisiondesc>
<list>
<item>1996-02-08 : JPW : proofread and revise</item>
<item>1996-02-01 : CMSMcQ : resume drafting, add list of tags
in appendix</item>
<item>1996-01-18 : CMSMcQ : draft on basis of group conversation
of yesterday</item>
</list>
</revisiondesc>

That is:

first the date in ISO form (yyyy-mm-dd)
then a space-colon-space sequence
then the initials of the person(s) making the change
then a space-colon-space sequence
then a description of the change

3.2 Font Shifts

All font shifts will be recorded, either using the <hi> element or using analytic tags like <emph>, <foreign>, etc.

Level hi-1 texts use <hi> in all cases, and do not analyse font shifts within paragraphs (it may be a recognition criterion for chapter breaks, etc.)
Level hi-2 texts analyse the font shifts and tag them as <emph>, <foreign>, <title>, etc., or as <hi> if no reason can be assigned with reasonable certainty.

Emphasis, foreign words, etc. not marked by font shifts in the source are not normally marked in CICTEI texts. If in an exceptional case, or in a text received from external sources such elements are marked, then all <emph>, <foreign>, <distinct>, <term>, <gloss>, or <mentioned>, elements which are not marked in the conventional way (by font shift or by quotation marks) will bear the attribute specification rend=unmarked.

At level hi-1, the following special types of rendition will be distinguished and identified [list needs checking]:

italics
bold
bold italics
single quotes
double quotes
large type
very large type
small type
unmarked (i.e. same as context)

This list will be revised periodically on the basis of experience, but additions to the list will not be carried back into already-encoded materials: as a result, the list of distinctions current in any document should be described in that document's header.

At level 2, further styles may be identified on a document by document basis; they will be documented in the <rendition> elements in the document's TEI header.

(See also the section on paragraphs and paragraph shapes.)

Highlighting at various levels is declared in the TEI header using the entities hi-1 and hi-2, which are declared thus:

< 4 Highlighting levels > =

 
<!ENTITY hi-1 
"Font shifts have uniformly been tagged as highlighting,
not as emphasis, foreign words, etc." >

<!ENTITY hi-2 
"Font shifts have been tagged as highlighting only when it
is not possible to identify them reliably 
as emphasis, foreign words, etc." >

A TEI header for a CICTEI document at level 1 might therefore read, in part:

 
<tagUsage gi=hi render=italics>&hi-1;</tagUsage>

3.3 Quotations

Quotations may or may not be identified.

Level q-1 texts identify no quotations at all: inline quotations have their quotation marks transcribed and block quotations are tagged as paragraphs with rend='block indent' (or something ...).
Level q-2 texts identify only block quotations.
Level q-3 texts identify quotations using the <q> element, but do not attribute them. Punctuation marks of the original are not retained; they are documented in the header's <rendition> element and the rend attribute.
Level q-4 texts identify the speaker or source using the who attribute.
Level q-5 texts identify the speakers and include a list of speaker codes in the header.

Quotations not marked by punctuation (quotation marks, guillemets, paragraph-initial dashes, etc.) are not recorded in CICTEI texts. If an ambitious analyst has tagged them, their rend attribute will have the value unmarked.

3.4 Typography

As noted elsewhere, font shifts will be noted.

Indentation and paragraph shape may be ignored, partially recorded, or recorded in some detail:

ps-1 ignores paragraph shape; it tags paragraphs and other component-level elements, but says nothing about layout.
ps-2 records paragraph shape with one or more of the following keywords in the rend attribute:
- indented (first line indented, left and right justified)
- block (first line not indented, left and right justified)
- ragright (first line indented, left smooth, ragged right margin)
- ragrightblock (first line not indented, left smooth, ragged right margin)
- ragleft (first line indented, smooth right, ragged left margin)
- ragleftblock (first line not indented, smooth right, ragged left margin)
- centered (each line of the paragraph is centered)
- other for anything too complex to be described with these keywords
The term display is added as a prefix if the element has indented left and right margins, as for a display quote.
ps-3 provides a new keyword for any shape represented in the list above by other; these are described in the TEI header.

3.5 Notes

[Notes are an open question; Mark said he would look at TEI Lite with an eye to the concerns Catherine expressed. I don't understand what those concerns were, so I don't know whether they pose a challenge or not. I'd propose the following:]

In CICTEI texts, all <note> elements bear a place attribute. Three encoding levels are distinguished:

n-1 transcribes notes where they appear (as they might come from a scanner) and does not attach them to their point of reference.
n-2 attaches notes to their point of reference with pointers, but does not move the notes.
n-3 transcribes notes at their point of reference.

N.B. these levels do not apply to inline block notes; they are always transcribed, at all levels, as <note place=inline>

3.6 Cross References

Cross references may or may not be identified. Three levels:

xr-1 leaves them unrecognized -- no <ptr>, <ref>, <xptr>, or <xref> elements should appear in level xr-1 texts.
xr-2 identifies some (but not necessarily all -- see the <tagUsage> element of the header) cross references, but does not necessarily resolve them.
xr-3 resolves the targets of identified cross references and supplies appropriate target attributes if the reference is in-file, and appropriate doc / from / to attributes for other cross references. If mechanical means have been used to identify cross references, claiming level 3 says only that the cross references identified have been resolved; it says nothing about how completely they have been identified.

3.7 Language Shifts and Foreign-Language Material

Language shifts may or may not be identified. Four encoding levels are distinguished:

lang-1 makes no effort to identify any but the base language. (All CICTEI documents identify their base language and the language of the header.)
lang-2 identifies some language shifts, but only those marked by font shifts, typography, or other detectable element boundaries. Language shifts at these boundaries are detected mechanically, and no claim is made that all language shifts have been recognized successfully.
lang-3 identifies the language of all elements. Manual inspection is used to check the language of all elements; lang=unknown is used if necessary
lang-4 identifies language shifts even if not marked typographically; <foreign> elements are inserted to bear the lang attribute.

3.8 Page Breaks

Page breaks in the source edition will invariably be marked. If none are present in material acquired from elsewhere, the source edition will be located and pagination will be added to the text on the basis of the source. Page breaks in other editions will not normally be marked; when in exceptional cases a CICTEI text records the pagination of multiple editions, both editions will be included in the <sourceDesc> element, with SGML identifiers; these SGML identifiers will be used as the values of the ed attribute of the <pb> element. The edition actually used as the source of the transcription must appear first.

When the <pb> element reflects a page break in the source, its ed attribute may be omitted -- i.e. the default ed value is the id of the first item in the source description.

3.9 Canonical References

Canonical reference schemes may or may not be provided, depending on encoding level:

cr-1 gives no canonical reference scheme
cr-2 gives one based on the manifest structure (act/scene, book/chapter/para, etc.)
cr-3 gives one or more canonical reference schemes common in the literature (e.g. Stephanus numbers for Plato). Page numbers of current scholarly editions will not be used as a canonical reference scheme.

3.10 Text Profile

The TEI profile description will always be present to give the main language of the text, and the language of the header; it may or may not give further information:

profile-1 gives only the information required by other rules of CICTEI tagging
profile-2 gives a description of the text in standard terms (to be specified more fully).

3.11 Correction and Normalization

Correction and normalization of spelling will not be performed by CIC encoders. If we acquire and clean data from others, we will spot check their transcriptions and indicate in the header whether we found corrections and normalizations, whether they were marked as such, and how much of the text we spot checked.

3.12 Provision of SGML Identifiers

CIC e-texts may or may not systematically provide IDs for elements to make hypertext linking easier. Several levels are distinguished:

id-1 may or may not include SGML identifiers
id-2 will systematically include SGML identifiers for all component-level elements within the text (paragraphs, lists, notes, etc.) and text divisions. Identifiers for elements in the header may or may not be provided.
id-3 provides SGML identifiers for all elements within the text and header, without exception.

4 Quality Assurance

This section describes standard levels of quality assurance which may be performed on CIC electronic texts. They may include:

validation of the SGML documents (with two parsers; sgmls or nsgmls, and yasp are recommended)
completeness checks
pagination checks
proofreading (spot checking and full proofreading)
mechanical and statistical checks on tag usage
checks on the validity of elements marked
checks on the completeness of element recognition

[More to be supplied.]

4.1 Validation

All CICTEI texts will be validated with SGML parsers before being made publicly accessible.

4.2 Proofreading

We will invariably spot-check all transcriptions (proofreading 0.5, 1.0, or 2.0 per cent of pages, and performing other checks to be specified) and the header will record the observed rate of typographic or other errors detected.

If we have other quality assurance checks, the header will also record the rate of errors found in spot checks. (Checks of the full text, which lead to correction of all errors, need not be recorded: the point is to provide some means of guessing the rate of errors still present in the text.)

5 Technical Details

The DTD fragments in this document are part of the extension files needed to modify the TEI main DTD in accordance with the policies described here.

There are two files we need to define. The file cicteix.ent includes modifications to the TEI's SGML entities:

< 5 CIC TEI Entities Modification File >(cicteix.ent) =

 
< Preliminaries for TEI.extensions.ent file 7 > 
< Linking attributes 17 > 
< Highlighting levels 4 > 
< Select TEI elements 9 >

The file cicteix.dtd includes declarations for new and modified SGML elements:

< 6 CIC TEI Elements Modification File >(cicteix.dtd) =

 
< CIC TEI Header 1 > 
< Header Contents 2 > 
< Encoding Description 3 >

The main task of the file cicteix.ent is to specify which TEI elements are to be suppressed.

< 7 Preliminaries for TEI.extensions.ent file > =

 
<!ENTITY % REDEFINE 'IGNORE' >
<!ENTITY % LEVELTWO 'INCLUDE' >
<!ENTITY % x.common 'text |' >

The entities file for the training version is the same; we'll edit it manually to supress level-two items:

< 8 CIC TEI Entities file (simplified) >(cictei1x.ent) =

 
<!ENTITY % LEVELTWO 'IGNORE' >
< CIC TEI Entities Modification File 5 >

The actual selections are the same in each case; the difference is handled by the differing declarations of LEVELTWO as INCLUDE or IGNORE.

< 9 Select TEI elements > =

 
< Select tags from TEI driver file 10 >

<!-- ******************************************************** -->
<!-- I.  Core tag sets.                                       -->
<!-- ******************************************************** -->

<!-- Chapter 5:  TEI Header ********************************* -->
< Select tags from TEI header 11 >
<!-- Chapter 6:  Elements Available in All TEI Documents **** -->
< Select tags from TEI core tag set 12 >
<!-- Chapter 7:  Default Text Structure ********************* -->
< Select tags from default text structure 13 >

<!-- ******************************************************** -->
<!-- II.  Base tag sets.                                      -->
<!-- II.A.  DTD files                                         -->
<!-- ******************************************************** -->

<!-- Chapter 8:  Prose * (included) ************************* -->
<!-- File:  TEIPROS2.DTD (no tags) ************************** -->
<!-- Chapter 9:  Verse * (excluded) ************************* -->
<!-- Chapter 10:  Drama * (excluded) ************************ -->
<!-- Chapter 11:  Transcriptions of Speech * (excluded) ***** -->
<!-- Chapter 12:  Print Dictionaries * (excluded) *********** -->
<!-- Chapter 13:  Terminological Data * (excluded) ********** -->
<!-- * Mixed Bases * (excluded) ***************************** -->

<!-- ******************************************************** -->
<!-- III.  Additional tag sets.                               -->
<!-- ******************************************************** -->

<!-- Chapter 14:  Linking, Segmentation, and Alignment ****** -->
< Select tags from tag set for linking and alignment 14 >
<!-- Chapter 15:  Simple Analytic Mechanisms **************** -->
< Select tags from tag set for simple analysis 15 >
<!-- Chapter 16:  Feature Structures * (excluded) *********** -->
<!-- Chapter 17:  Certainty and Responsibility * (excluded) * -->
<!-- Chapter 18:  Transcription of Primary Sources * (excl) * -->
<!-- Chapter 19:  Critical Apparatus * (excluded) *********** -->
<!-- Chapter 20:  Names and Dates * (excluded) ************** -->
<!-- Chapter 21:  Graphs, Networks, and Trees * (excluded) ** -->
<!-- Chapter 22:  Tables, Formulae, and Graphics ************ -->
< Select tags from tag set for tables and figures 16 >
<!-- Chapter 23:  Language Corpora * (excluded) ************* -->
<!-- Chapter 27:  Tag Set Documentation ********************* -->

In the main TEI driver file, we select only the <tei.2> element, suppressing <teiCorpus.2>:

< 10 Select tags from TEI driver file > =

 
<!-- FILE:  TEI2.DTD -->
<!ENTITY % TEI.2        'INCLUDE' >
<!ENTITY % teiCorpus.2  'IGNORE' >

In the header,

< 11 Select tags from TEI header > =

 
<!-- File:  TEIHDR2.DTD -->
<!ENTITY % teiHeader    '%REDEFINE;' >
<!ENTITY % fileDesc     '%REDEFINE;' >
<!ENTITY % titleStmt    '%REDEFINE;' >
<!ENTITY % sponsor      'INCLUDE' -- ? -- >
<!ENTITY % funder       'INCLUDE' -- ? -- >
<!ENTITY % principal    'INCLUDE' -- ? -- >
<!ENTITY % editionStmt  '%REDEFINE;' -- ? -- >
<!ENTITY % edition      'INCLUDE' -- ? -- >
<!ENTITY % extent       'INCLUDE' -- ? -- >
<!ENTITY % publicationStmt '%REDEFINE;' >
<!ENTITY % distributor  'INCLUDE' >
<!ENTITY % authority    'INCLUDE' >
<!ENTITY % idno         'INCLUDE' >
<!ENTITY % availability 'INCLUDE' -- ? -- >
<!ENTITY % seriesStmt   '%REDEFINE;' >
<!ENTITY % notesStmt    'INCLUDE' >
<!ENTITY % sourceDesc   'INCLUDE' >
<!ENTITY % scriptStmt                  'IGNORE' >
<!ENTITY % recordingStmt               'IGNORE' >
<!ENTITY % recording                   'IGNORE' >
<!ENTITY % equipment                   'IGNORE' >
<!ENTITY % broadcast                   'IGNORE' >
<!ENTITY % encodingDesc  'INCLUDE' >
<!ENTITY % projectDesc   'INCLUDE' >
<!ENTITY % samplingDecl  'INCLUDE' >
<!ENTITY % editorialDecl 'INCLUDE' >
<!ENTITY % correction                  'IGNORE' -- ? -- >
<!ENTITY % normalization               'IGNORE' -- ? -- >
<!ENTITY % quotation                   'IGNORE' -- ? -- >
<!ENTITY % hyphenation                 'IGNORE' -- ? -- >
<!ENTITY % segmentation                'IGNORE' -- ? -- >
<!ENTITY % stdVals                     'IGNORE' -- ? -- >
<!ENTITY % interpretation              'IGNORE' -- ? -- >
<!ENTITY % tagsDecl      '%LEVELTWO;' >
<!ENTITY % tagUsage      '%LEVELTWO;' >
<!ENTITY % rendition     '%LEVELTWO;' >
<!ENTITY % refsDecl      '%LEVELTWO;' >
<!ENTITY % step                        'IGNORE' -- ? -- >
<!ENTITY % state                       'IGNORE' >
<!ENTITY % classDecl     '%LEVELTWO;' >
<!ENTITY % taxonomy      '%LEVELTWO;' >
<!ENTITY % category      '%LEVELTWO;' >
<!ENTITY % catDesc       '%LEVELTWO;' >
<!ENTITY % fsdDecl                     'IGNORE' >
<!ENTITY % metDecl                     'IGNORE' >
<!ENTITY % symbol                      'IGNORE' >
<!ENTITY % variantEncoding             'IGNORE' >
<!ENTITY % profileDesc  'INCLUDE' >
<!ENTITY % creation     '%LEVELTWO;' >
<!ENTITY % langUsage    'INCLUDE' >
<!ENTITY % language     'INCLUDE' >
<!ENTITY % textClass    '%LEVELTWO;' >
<!ENTITY % keywords     '%LEVELTWO;' >
<!ENTITY % classCode    '%LEVELTWO;' >
<!ENTITY % catRef       '%LEVELTWO;' >
<!ENTITY % revisionDesc 'INCLUDE' >
<!ENTITY % change       '%LEVELTWO;' >

In the TEI core,

< 12 Select tags from TEI core tag set > =

 
<!-- File:  TEICORE2.DTD -->
<!ENTITY % p            'INCLUDE' >
<!ENTITY % foreign      '%LEVELTWO;' >
<!ENTITY % emph         '%LEVELTWO;' >
<!ENTITY % hi           'INCLUDE' >
<!ENTITY % distinct     '%LEVELTWO;' >
<!ENTITY % q            'INCLUDE' >
<!ENTITY % quote        '%LEVELTWO;' >
<!ENTITY % cit          '%LEVELTWO;' >
<!ENTITY % soCalled     '%LEVELTWO;' >
<!ENTITY % term         '%LEVELTWO;' >
<!ENTITY % mentioned    '%LEVELTWO;' >
<!ENTITY % gloss        '%LEVELTWO;' >
<!ENTITY % name         'INCLUDE' >
<!ENTITY % rs           '%LEVELTWO;' >
<!ENTITY % num          '%LEVELTWO;' >
<!ENTITY % measure                     'IGNORE' >
<!ENTITY % date         'INCLUDE' >
<!ENTITY % dateRange                   'IGNORE' >
<!ENTITY % time         '%LEVELTWO;' >
<!ENTITY % timeRange                   'IGNORE' >
<!ENTITY % abbr         '%LEVELTWO;' >
<!ENTITY % expan                       'IGNORE' -- ? -- >
<!ENTITY % sic          '%LEVELTWO;' >
<!ENTITY % corr         '%LEVELTWO;' >
<!ENTITY % reg          '%LEVELTWO;' -- ? -- >
<!ENTITY % orig         '%LEVELTWO;' -- ? -- >
<!ENTITY % gap          'INCLUDE' >
<!ENTITY % add          '%LEVELTWO;' -- ? -- >
<!ENTITY % del          '%LEVELTWO;' -- ? -- >
<!ENTITY % unclear      '%LEVELTWO;' >
<!ENTITY % address      'INCLUDE' >
<!ENTITY % addrLine     'INCLUDE' >
<!ENTITY % street                      'IGNORE' >
<!ENTITY % postCode                    'IGNORE' >
<!ENTITY % postBox                     'IGNORE' >
<!ENTITY % ptr          'INCLUDE' >
<!ENTITY % ref          'INCLUDE' >
<!ENTITY % list         'INCLUDE' >
<!ENTITY % item         'INCLUDE' >
<!ENTITY % label        'INCLUDE' >
<!ENTITY % head         'INCLUDE' >
<!ENTITY % headLabel                   'IGNORE' >
<!ENTITY % headItem                    'IGNORE' >
<!ENTITY % note         'INCLUDE' >
<!ENTITY % index        '%LEVELTWO;' >
<!ENTITY % divGen       '%LEVELTWO;' >
<!ENTITY % milestone    '%LEVELTWO;' >
<!ENTITY % pb           'INCLUDE' >
<!ENTITY % lb           '%LEVELTWO;' >
<!ENTITY % cb                          'IGNORE' >
<!ENTITY % bibl                                    '%REDEFINE'>
<!ENTITY % biblStruct                  'IGNORE' >
<!ENTITY % biblFull     'INCLUDE' >
<!ENTITY % listBibl     'INCLUDE' >
<!ENTITY % analytic                    'IGNORE' >
<!ENTITY % monogr                      'IGNORE' >
<!ENTITY % series                      'IGNORE' >
<!ENTITY % author       'INCLUDE' >
<!ENTITY % editor       'INCLUDE' >
<!ENTITY % respStmt     'INCLUDE' >
<!ENTITY % resp         'INCLUDE' >
<!ENTITY % title        'INCLUDE' >
<!ENTITY % meeting                     'IGNORE' -- ? -- >
<!ENTITY % imprint      'INCLUDE' >
<!ENTITY % publisher    'INCLUDE' >
<!ENTITY % biblScope    'INCLUDE' >
<!ENTITY % pubPlace     'INCLUDE' >
<!ENTITY % l            'INCLUDE' >
<!ENTITY % lg           'INCLUDE' >
<!ENTITY % sp           'INCLUDE' >
<!ENTITY % speaker      'INCLUDE' >
<!ENTITY % stage        'INCLUDE' >

< 13 Select tags from default text structure > =

 
<!-- File:  TEISTR2.DTD -->
<!ENTITY % text         'INCLUDE' >
<!ENTITY % body         'INCLUDE' >
<!ENTITY % group        '%LEVELTWO;' >
<!ENTITY % div          '%LEVELTWO;' >
<!ENTITY % div0         'INCLUDE' >
<!ENTITY % div1         'INCLUDE' >
<!ENTITY % div2         'INCLUDE' >
<!ENTITY % div3         'INCLUDE' >
<!ENTITY % div4         'INCLUDE' >
<!ENTITY % div5         'INCLUDE' >
<!ENTITY % div6         'INCLUDE' >
<!ENTITY % div7         'INCLUDE' >
<!ENTITY % trailer      'INCLUDE' >
<!ENTITY % byline       'INCLUDE' >
<!ENTITY % dateline                                '%REDEFINE' >
<!ENTITY % argument     '%LEVELTWO;' >
<!ENTITY % epigraph     '%LEVELTWO;' >
<!ENTITY % opener       'INCLUDE' >
<!ENTITY % closer       'INCLUDE' >
<!ENTITY % salute       'INCLUDE' >
<!ENTITY % signed       'INCLUDE' >

<!-- File:  TEIFRON2.DTD -->
<!ENTITY % front        'INCLUDE' >
<!ENTITY % titlePage    'INCLUDE' >
<!ENTITY % docTitle     'INCLUDE' >
<!ENTITY % titlePart    'INCLUDE' >
<!ENTITY % docAuthor    'INCLUDE' >
<!ENTITY % imprimatur                  'IGNORE' -- ? -- >
<!ENTITY % docEdition   'INCLUDE' >
<!ENTITY % docImprint   'INCLUDE' >
<!ENTITY % docDate      'INCLUDE' >

<!-- File:  TEIBACK2.DTD -->
<!ENTITY % back         'INCLUDE' >

< 14 Select tags from tag set for linking and alignment > =

 
<!-- File:  TEILINK2.ENT -->
<!-- File:  TEILINK2.DTD -->
<!ENTITY % link                        'IGNORE' -- ? -- >
<!ENTITY % linkGrp                     'IGNORE' -- ? -- >
<!ENTITY % xref         '%LEVELTWO;' >
<!ENTITY % xptr         '%LEVELTWO;' >
<!ENTITY % seg          'INCLUDE' >
<!ENTITY % anchor       'INCLUDE' >
<!ENTITY % when                        'IGNORE' >
<!ENTITY % timeline                    'IGNORE' >
<!ENTITY % join                        'IGNORE' >
<!ENTITY % joinGrp                     'IGNORE' >
<!ENTITY % alt                         'IGNORE' >
<!ENTITY % altGrp                      'IGNORE' >

< 15 Select tags from tag set for simple analysis > =

 
<!-- File:  TEIANA2.ENT -->
<!-- File:  TEIANA2.DTD -->
<!ENTITY % span                        'IGNORE' -- ? -- >
<!ENTITY % spanGrp                     'IGNORE' >
<!ENTITY % interp       '%LEVELTWO;' >
<!ENTITY % interpGrp    '%LEVELTWO;' >
<!ENTITY % s            '%LEVELTWO;' >
<!ENTITY % cl                          'IGNORE' >
<!ENTITY % phr                         'IGNORE' >
<!ENTITY % w                           'IGNORE' >
<!ENTITY % m                           'IGNORE' >
<!ENTITY % c                           'IGNORE' >

< 16 Select tags from tag set for tables and figures > =

 
<!-- File:  TEIFIG2.ENT -->
<!ENTITY % formulaNotations 'CDATA'                             >
<!ENTITY % formulaContent 'CDATA'                               >

<!-- File:  TEIFIG2.DTD -->
<!ENTITY % table        'INCLUDE' >
<!ENTITY % row          'INCLUDE' >
<!ENTITY % cell         'INCLUDE' >
<!ENTITY % formula      'INCLUDE' >
<!ENTITY % figure       'INCLUDE' >
<!ENTITY % figDesc      'INCLUDE' >

6 Other Documents

We need a full, formal specification of the rules just stated, and whatever else we agree on. It will be easiest for me if this is basically a set of rules for using TEI Lite -- i.e. if it doesn't repeat anything in the TEI Lite documentation. Working title: "Rules for Use of TEI Lite in CIC E-Text Projects".

We could use, but don't absolutely require, a modified version of the TEI Lite specification which incorporates these rules. To make such a document, we'll need copyright permission from the TEI, which I'll ask for if people are eager to have a single unified document describing CICTEI.

We may need SGML and CICTEI tutorials for training staff. I believe this is not part of our current task.

Summary of Tag Usage

This appendix specifies for each tag set of the TEI, and for each tag in selected tag sets, whether it is:

not used (i.e. we don't knowingly include it in material we produce; on the other hand, we don't promise to take it out of material we rework.
used in all cases which meet certain recognition criteria (which will be specified)
used if the encoding is labeled as being at a certain level and certain recognition criteria are met

Tag Sets

The following tag sets are included in TEI Lite and therefore in the CICTEI tag set:

the core tag set
the TEI header
the base tag set for prose
the

The following tag sets, therefore, are not selected:

verse
drama
speech
printed dictionaries
terminological data
feature structures
certainty and responsibility
transcription of primary sources (manuscripts)
critical apparatus
names and dates
graphs, networks, and trees
language corpora

Tag Sets and their Tags

Document elements

Used in all cases:

<tei.2>

Not used (omitted from TEI Lite). [Restore for CIC?]

<teiCorpus.2>

Header

Used in all CICTEI texts:

<teiHeader>
<fileDesc>
<titleStmt>
<publicationStmt>
<sourceDesc>
<encodingDesc>
<projectDesc>
<samplingDecl>
<editorialDecl>
<tagsDecl>
<tagUsage>
<rendition>
<refsDecl>
<creation>
<langUsage>
<language>
<revisionDesc>
<change>

N.B. many of these will be standard elements which describe the markup practices in various levels of encoding. They will not all require much manual intervention.

Used when applicable [needs further specification for each element type]:

<sponsor>
<funder>
<principal>
<editionStmt>
<edition>
<extent> (? always?)
<distributor>
<authority>
<idno>
<availability>
<seriesStmt>
<notesStmt>
<classDecl>
<taxonomy>
<category>
<catDesc>
<profileDesc>
<textClass>
<keywords>
<classCode>
<catRef>

Not used (omitted from TEI Lite):

<scriptStmt>
<recordingStmt>
<recording>
<equipment>
<broadcast>
<correction>
<normalization>
<quotation>
<hyphenation>
<segmentation>
<stdVals>
<interpretation>
<step>
<state>
<fsdDecl>
<metDecl>
<symbol>
<variantEncoding>

Core Tag Set

Used:[1]

<p>
<foreign>
<emph>
<hi>
<q>
<cit>
<soCalled>
<term>
<mentioned>
<gloss>
<name>
<rs>
<num>
<date>
<time>
<abbr>
<sic>
<corr>
<reg>
<orig>
<gap>
<add>
<del>
<unclear>
<address>
<addrLine>
<ptr>
<ref>
<list>
<item>
<label>
<head>
<note>
<index>
<divGen>
<milestone>
<pb>
<lb>
<bibl>
<biblFull>
<listBibl>
<author>
<editor>
<respStmt>
<resp>
<title>
<imprint>
<publisher>
<biblScope>
<pubPlace>
<l>
<lg>
<sp>
<speaker>
<stage>

Not used (omitted from TEI Lite):

<distinct>
<quote>
<measure>
<dateRange>
<timeRange>
<expan>
<street>
<postCode>
<postBox>
<headLabel>
<headItem>
<cb>
<biblStruct>
<analytic>
<monogr>
<series>
<meeting>

Text Structure

Used:

<text>
<body>
<group>
<div>
<div0>
<div1>
<div2>
<div3>
<div4>
<div5>
<div6>
<div7>
<trailer>
<byline>
<dateline>
<argument>
<epigraph>
<opener>
<closer>
<salute>
<signed>

Front and Back Matter

<front>
<titlePage>
<docTitle>
<titlePart>
<docAuthor>
<docEdition>
<docImprint>
<docDate>
<back>

Not used (omitted from TEI Lite):

<imprimatur>

Linking and Alignment

<xref>
<xptr>
<seg>
<anchor>

Not used (omitted from TEI Lite):

<link>
<linkGrp>
<when>
<timeline>
<join>
<joinGrp>
<alt>
<altGrp>

Simple Analytic Mechanisms

<interp>
<interpGrp>
<s>

Not used (omitted from TEI Lite):

<span>
<spanGrp>
<cl>
<phr>
<w>
<m>
<c>

Tables, Formulae, and Graphics

<table>
<row>
<cell>
<formula>
<figure>
<figDesc>

Tag Set Documentation

TEI Lite integrates several elements from this auxiliary tag set, so they can be used in prose:

<eg>
<code>
<ident>
<kw>
<gi>

Additional Elements

The CICTEI tag set extends the TEI element-class system as follows:

Elements <ident>, <code>, and <kw> are added to class data
Element <eg> is added to classes inter and common
Element <divGen> is added to class front (to fix an oversight in TEI P3)

All of these changes are the same as in TEI Lite.

The global attribute class linking is modified; it uses the following declaration instead of the standard one:

< 17 Linking attributes > =

 
<!ENTITY % a.linking '
          corresp            IDREFS              #IMPLIED
          next               IDREF               #IMPLIED
          prev               IDREF               #IMPLIED'      >

Open Issues

The following issues need to be decided before this document is final:

whether encoding levels should be global or topic-by-topic
whether list of highlighting styles is complete enough and simple enough
whether to recognize unusual paragraph formatting in more detail
how to distinguish source-text page breaks from page breaks in other editions also noted in the electronic text
what information to put into the profile description
what percentage of a text to proofread in spot checking

Notes

[1] Recognition criteria and so on need to be specified for each of these. Probably the best approach is to group them into classes of elements always used, never used, used under certain (specified) conditions, ...
[return to text]

Rules for Use of TEI Lite in CIC E-Text Projects

Table of Contents

Notes