To be amenable to computer processing, texts must be properly encoded. The user must know 1) what the text is, 2) how the various textual features are encoded, and 3) whether there are extra-textual, interpretive features and, if so, how they are encoded. Standards or, at least, some basic guidelines will simplify the task both for producers and users of machine-readable texts. The three points above correspond to three of the working committees of the Text Encoding Initiative. This committee addresses the problem of text representation, i.e. the second point.
Conventional printed texts use standard alphabets and typographical conventions to structure the text. In devising guidelines for machine-readable texts, it is natural to take conventions from printed texts as a starting-point and suggest ways of expressing typographical distinctions in machine-readable texts. The committee on text representation will handle features for which there are accepted typographical conventions. Topics within the field of this committee include the marking or encoding of:
If the encoding will in some respects result in loss of information compared with the printed text (e.g. as regards physical characteristics), in others it may well go beyond it. For example, provision may be made for disambiguation of features such as: capitalisation to represent names and sentence openings, full stop to mark abbreviations and end of sentences, apostrophe vs end-of-quote, italics to mark emphasis vs foreign words or expressions.
The suggested guidelines for text representation should ultimately be able to handle texts originally produced in machine-readable form as well as machine- readable versions of printed texts and unprinted texts (such as letters and diaries). The special problems of spoken texts (with the exception of the International Phonetic Alphabet, IPA, which will be treated as a character set) and of dictionaries will for pragmatic reasons be taken up in the Committee on Text Analysis and Interpretation.
In devising coding conventions it is essential to study existing schemes and attempt to discover the consensus of the textual computing community. Existing standards will be honoured, wherever possible. It is expected that the suggested Markup Language (SGML) defined by the international standard ISO 8879, unless the needs of textual research will make it impossible to conform strictly to SGML.
In a second document I will come back to practical matters connected with the work of the committee (division of work, meetings, timetable, financial arrangements).
Oslo