=========================================================================
Date: 29 January 1991 15:21:26 CST
From: "Michael Sperberg-McQueen 312 996-2477 -2981" <U35395@UICVM>
Comment: "ACH / ACL / ALLC Text Encoding Initiative"
To:   "Paul A. Fortier 204 474-9841" <FORTIER@UOFMCC>
cc:   "Lou Burnard +44 (865) 273 238" <LOU@VAX.OX.AC.UK>
Subject: preliminary notes on your critique

First of all, many many thanks for your paper.  It gives me great
pleasure to see someone reading the guidelines with so much care and
concern, and I appreciate the time that has gone into it.  I believe we
have some apparent disagreements which will turn out to be without
substance, and some which will turn out to reflect basic differences of
opinion.  But I enjoyed your paper a lot all the same.

I'll discuss individual points below (and possibly in a following note
if I run out of time writing this today); before I do, some recurrent
themes should probably be taken care of once instead of many times.

1.  At various points, I agree with what you interpret the current draft
as saying, and disagree over whether that's what they should say.  At
others, I disagree with your paraphrase of the draft and will say below
'How can you say that the draft says this?'  I should say in advance and
globally that I believe each passage which occasions either such comment
is ipso facto in need of revision.  The first set, because the draft
should either be changed to say something else, or be made more
persuasive.  The second set, because if a passage has appeared unclear
or misleading to you, it clearly needs to be reformulated, even if the
reformulation does not change the substance of the utterance.

2.  You find many failings in the current draft, viewed as a guide to
the non-technical computer user.  I will not address all of these
individually, because what astonishes me is that you find so few, given
that the current draft is not intended at all as a guide for
non-technical users.  As the preface says, 'It is clear that the
Guidelines cannot achieve a wide audience without some kind of informal
introductory guide, but such a guide is not yet available.'  (It was
punchier in a draft:  'but this is emphatically not it.')  Suffice it to
say that I am concentrating for the moment on what the recommendations
should be, and not yet on finding formulations for them which will not
mislead the naive reader.  It is not too early to begin worrying about
the non-technical users, but I think it will be useful to distinguish
the substantive discussion over what practices should be recommended,
and the pedagogical discussion over what rhetorical strategies should be
marshalled in the exposition of those practices.

The draft in our hands now is intended as a first step towards one part
of what will require at least three parts.  1 A reference manual which
lays things out in a form suitable for reference, without pedagogical
structure (parts of the appendices will go there; the body of the
reference manual was not ready for inclusion in the current draft).  2 A
discursive presentation of the material, with a certain rhetorical
coherence and topical organization, at a sometimes rather technical
level.  3 An introductory guide which covers only the core issues and
assumes much less computer sophistication than the other two documents
(parts of the SGML tutorial would go there).

What we have is a rough draft of document 2.  It is not (and, I submit,
probably should not be) an introduction to computer use for the
non-computer-using humanist, nor an introduction to the use of word
processors and editors, nor a programming text, nor an introduction to
formal language theory, nor a guide to the current state of the SGML
marketplace, nor the larger text-handling software marketplace.  (Nor,
for that matter, an introduction to analytic bibliography, literary
study, or linguistics, for those in need of such introduction.) Such
texts need to be separate documents.  What I am learning is that they
are needed now, already, by people we had not thought would need them.
But they'll be done (for now, at least) as separate documents, not as
additions to this one.

3.  I think a lot of difference in our points of view reflects somewhat
different assumptions about how texts are or will be prepared and used.
Both of us are awake enough to know that text preparation and use vary
widely among the community we are trying to serve, but I think we make
different assumptions about the 'normal' case -- defined, if in no other
way, as what we think about when we try offhand to see how some tag or
other will work in practice.

You assume rather explicitly I think (correct me where I err):

    - an unskilled coder (typically an undergraduate) who prepares the
         text for processing
    - a skilled interpreter / scholar who processes the text but does
         not typically change its coding or tagging
    - a single process of 'encoding' the text which produces the
         'encoded text' and which has a distinct end point (or at
         most a very small number of revision processes, each pretty
         much discrete, finite, and clearly punctuated)
    - a single copy of the text, processed perhaps in various ways
         but (always?) using the same view of the text (same text,
         same tags)
    - (therefore) a strong case for keeping the tagging non-controversial
         and non-subjective

I assume (also rather explicitly, if you read my forthcoming paper in
L&LC):

    - a coder of any level of skill you like to imagine (low, high, or
         very low, e.g. a scanner)
    - an interpreter who interacts with the text and whose work on the
         the text may (often does) lead to changes in the encoding
    - an ongoing (not a finite) process of encoding the text, leading
         incrementally from an initial encoding to more and more
         elaborate encodings recording not only the input but also the
         output of analysis and interpretation
    - a single copy of the text, processed in widely varying ways
         and providing (through the use of filters) multiple views
         of the text (same or different base text, different taggings)
    - (therefore) a strong case for making the tag set as expressive
         and complete as possible -- one's new work should be able to
         build on one's old work or that of others, and therefore the
         results of one's work must be expressible in the markup

Consider the comments of respondent 36, who envisages multiple levels of
analysis and commentary on the compound 'winter-cearu' in Anglo-Saxon.
Consider the requirements of someone who wants to encode a text so as to
be able to find, later, all the words of a certain semantic field and
who must therefore be able to gloss words in a systematic way (and who
may wish to remain agnostic as to the correct resolution of some
ambiguities in a text).  This is not work which an undergraduate
encoding the text will do.  But to be useful for the later processing
which is intended, such matter must be *encoded* as part of the encoded
text.  This is the most difficult problem for the scheme, from a
technical point of view:  ensuring that the scheme can carry as much
information as we are willing and able to give it, and that information
given for one purpose does not interfere with information added for
another (we may want text-critical variants, for whatever reason, but
wish for a while to ignore them and process the metrical characteristics
of the base text alone).

The current draft clearly gives far too many people the impression (far
too strongly) that everything it mentions is supposed to be tagged by
everyone, all the time.  Since it does not actually say that, ever, and
actually says the reverse from time to time, we are clearly dealing with
a misconception which requires very vigorous action to prevent.

So:  I agree with you that the document should be clearer about the
inherent optionality of virtually all the information it makes it
possible to tag.  I hope you will agree with me that the variety of
approaches people will use (already use) in making and using electronic
texts requires us to cater for both models of how texts are made and
used.

4.  I note with some faint disappointment that you say very little in
your paper about specifically literary forms of analysis or about
specifically literary text forms and what must be added to deal with
them properly.

------

Now to individual items.

 ---------------
 Your section 1.
 ---------------

p. 1 (1.1.1).  You are quite right; this paragraph needs to distinguish
more sharply between the use of the guidelines in deciding what features
of a text to capture and their use in finding a representation for them.
It does not, however, promise guidance of any kind to a 'neophyte'!

p. 3 (1.1.4).  I believe you've misread this section.  It does *not*
recommend full SGML tags in data capture; it points out, indeed, that P1
makes *no* recommendations about the specific form to be used in data
capture.  (It also says why:  any such general recommendations must be
vacuous or else inapplicable in many circumstances.)

p. 4 (1.1.4).  Examples of how to construct keyboard macros in one or
more commonly used word processors would be appropriate for a tutorial;
this document is not in a position, however, to teach readers how to use
their word processor!

p. 15 (2.1.4) I do not understand your reading of P1 here.  I see no
recommendation at all about the embedding of interpretations.  The
inherent characteristics of DTDs which allow us to treat them as
interpretively significant are common to *all* markup and thus
inescapable.

p. 16 (2.1.4.2) Use of minimization is good or bad depending on what
kind of computing environment one has.  No working committee has yet
been willing to make global recommendations for data capture or local
processing, and several have fiercely resisted the suggestion that
anyone should or could.  If you can persuade the relevant committees
that minimization should be recommended for local use, more power to
you.  For now, it is simply a technique which should be illustrated,
with others, in tutorials and cookbook presentations of how to use the
TEI.  Me, I find it invaluable for local use, but I have seen enough
other people's computing environments to be wary of quick
generalization.

p. 16 which is too wordy?  the text or the tagging?

p. 55 (4.1.4) Are you saying that scholars in some fields refer to books
with different pagination and lineation as the 'same edition'?!  This is
new to me; if the book is reset, I was taught to call it a new edition.
Publisher (or printer, in books to about 1700) and date would then
uniquely determine the item in question.  Certainly the usual scholarly
and bibliothecal practice as encoded in the MLA style book and the
relevant library standards appear to find city, publisher, and date
sufficient.  Can you expand on your remark?

(Also, N.B. this brief discussion includes exemplary items of
information only; the recommendations for what to include and what to
regard as optional are in section 4.3).

p. 65 (4.5) If local conventions survive into the interchange format of
a text, you are right.  But I did not imagine that they were supposed to
do so.

p. 77 (5.2.5) Colophon.  Yes, some puzzled readers have asked what this
is.  I don't know any alternative term, however, so perhaps we just need
to define it at greater length.

p. 77 (5.3.1) and passim.  I am puzzled by your insistence on the need
not merely to provide notation for, but to *recommend* the encoding of
line breaks.  They are seldom the object of literary research (at most,
of research in analytic bibliography); they are typically ignored
silently in critical editions (except in diplomatic editions of
manuscript texts, which are the exception and not the rule); they are
very seldom used in citing texts in critical studies (the MLA style
sheet does not require them; the Chicago Manual of Style does not
mention them, let alone recommend them or require them).  If line breaks
are *required* for literary work, why do the manuals of scholarly
practice not mention them?  Why do the articles in MLA and JEGP and
MLN and Romance Philology not use them?

I am willing to agree that it is useful to be able to tag them, but the
evidence of current scholarly practice (not to mention the practice of
most encoders, as I have seen it, -- I make the assumption that they
encoded what they felt they needed to encode) suggests that they are
merely useful, no more.

(All this of course applies only to prose; in verse texts the importance
of line marking is obvious and explicitly stated.)

p. 125 (5.11.2) 'downplayed' how?  I am not sure what kinds of layout
information you are arguing for, nor how you believe they are being
downplayed.  I cheerfully confess that I think rather less highly of the
importance of capturing the physical page in the usual case (and I think
the practice of critical editions, which typically record no information
on the layout of the copy text, suggests that the consensus of
scholarship doesn't regard detailed layout information as so important,
either).  Here, however, I see no downplaying at all.  Not everyone is
interested in physical layout; for those who are, there are tags.

p. 178 (7.3.1.2) on rhyme.  Since the description of rhyme annotation is
offered as an example only, I miss the prescriptive tone you disapprove
of.  Your point about French is quite important however; what is the
usual notation for rhyme schemes among Romance metrists?

p. 200 (8.4.1) This entire chapter needs thorough reworking in any case.

pp. 207-209 (A.1) I maintain my view that the lineation of the Signet
paperback of Jane Eyre has no intrinsic or scholarly value.  I don't
know how you got the notion that copyright was an issue here; we
transcribed this chapter from the volume I had on my shelf because it
was handy.  Any prescriptive tutorial on tagging would do well, of
course, to use a standard edition.

 ---------------
 Your section 2.
 ---------------

General.  You are very right about the lack of precision in defining and
distinguishing types and levels of TEI conformance, requirements,
recommendations, etc.  This was originally left vague partly because it
was vague in the deliberations of the committees and partly in the
belief that defining a set of T.E.I. Recommendations with capital R
would reify the guidelines too much and obscure the essential
flexibility.  There was much fear that funding agencies might apply any
too-clearly defined Recommendations as Procrustean requirements for all
projects whether appropriate for them or not.  A certain mistiness of
definition seemed one way to counter this danger.

Discussion since the appearance of the draft has taught me that more
clarity, not less clarity, is needed, and funding agencies need to be
told very explicitly what are and what are *not* appropriate ways to
measure the use a project may make of the TEI scheme.

Your suggestions for grouping are worth considering; we should talk
about them viva voce.

p. 1 (1.1.2) I don't see where you see the 'grudging' acceptance that
interchange and local processing are two different things.  What one
does in the privacy of one's own CPU is one's own business and no
concern of this draft.  That is why no recommendations are made about
data capture!

Why *recommend* the use of shorter tags, though?  If people want them,
they can have them; why need we recommend them?

pp. 45-52 (3.2) character sets.  Clearly we do need more clarity on the
interchange/local processing distinction throughout this chapter (though
this is where the concepts have been made most precise!) -- but how is
one to SHOW the character set code used in a Mac or PC?

pp. 82-83 (proper names, abbreviations).  That these are intended to be
optional is shown by the use of 'may'.  Here and passim, obviously,
allowance must be made for people's tendency to fear the worst.

passim.  Your suggestions for examples are good ones.  Thank you.

 ---------------
 Your section 3.
 ---------------

I believe your distinction between encoding and interpretation, while a
useful point of departure for individual cases, breaks down in practice
when one attempts to apply it globally.  The distinction between capital
and lowercase letters, and spelling in general, is interpretive in many
manuscripts (is that a capital letter in the manuscript, or only a
slightly malformed uncial?).  The line breaks (in the metrical sense) of
Beowulf are the topic of scholarly interpretation and research (Robert
Creed found it a publishable result that he was able to formulate rules
for deciding where to place the verse break in controversial cases).

You argue, if I understand you, that scholars don't want their encoders
deciding that something is in quotes because it's ironic or because it's
really quoted, because they want to do that themselves.  How are they to
record their decisions, however, if we provide no tags for expressing
such distinctions?  Don't tell me that no one wishes to build further
work on such distinctions:  there *are* concordances which distinguish
between the vocabulary of the author and that of the authorities quoted
by the author -- one need only consider the Index Thomisticus.

It is inescapable that some people wish to tag such items one way, and
others another way, and that the TEI must therefore provide tags for
both approaches.  Since the more interpretive tags may be reduced to
their less-interpretive equivalents, the provision of the interpretive
information does *not* bar the reinterpretation of the data by others.
And since the richer the text, the more sensitive can be our processing,
I believe the current draft is right to say that the richer encoding is
preferable where it is practicable and appropriate.  (And n.b. more than
that, the current draft does not say.  I have taken heat on this subject
from readers who consider us to have caved in to the
presentation-oriented taggers.)

p. 105 (5.8.1) explicit tagging of sentences.  No such tagging is
proposed or given here.  The tag given here is for marking arbitrary
segments of text (s = segment); the discussion explains how it may be
used to tag *orthographic* sentences.  In your dichotomy between the
objective and the interpretive, I believe orthographic sentences (those
explicitly marked by end punctuation in the copy text) ought to fall on
the former side.

p. 105 (5.8.2) on removal of quotation marks.  If you show me a single
text encoded by anyone before 1985 which retains the rendition of
quotation marks (distinguishing opening from closing marks, and making
clear from the file itself which form they took, of the 20 or so forms
that quotation marks take in European publishing), then I will buy you a
beer in Tempe.  Two beers.  Bear in mind that before the advent of
microcomputers, almost no vendor-provided character set possessed
distinct characters for open and close quotation marks.  No standard
character set does, that I know of, to this day.  The text is quite
explicit, I thought, that such 'removal' of data is to be contemplated
only where the removal is reversible (hence the use of the word
'redundant').

p. 124 (5.11.1) Clarissa example.  To clutter an example of treatment of
italics by inclusion of page and line boundaries would be a bit much, I
should have thought.

Why would 'Anglice``' be Italian?  Would the Italian not be '[per]
inglese'?

You are right that it is difficult to decide with certainty why the
italics are used.  But that is why this example is used to show how to
encode the mere fact of italics when a further distinction is not
appropriate!

p. 214 (Hamlet).  The problem you note is a real one, though I would
have said it's a problem with the taxonomy of stage directions proposed
(a rather silly one) rather than with the notion of analytic markup.

 ---------------
 Your section 4.
 ---------------

p. 76 (5.2.4) touche''.

p. 105 (5.8.2) I understand the committee's recommendation to mean 'Do
with hyphens what you would do were you editing the text or typesetting
it:  respect or suppress.'  I agree that a bit more clarity and further
discussion would be welcome.

p. 110 (5.10.3) this is 'humorous'?  Remind me to tell you some other
good ones sometime!  It is difficult to find a real example with the
requisite features of brevity, simplicity, and a range of manuscript
configurations permitting exposition of nested and non-nested,
overlapping and discontiguous variants.  If anyone finds a suitable real
example, I shall be very happy.  (And buy them a beer.)

p. 129 What is a mug's game?  (Other than poetry, I mean.)

pp. 140-44 (6.2.4) You seem to be assuming that the only way to look at
a machine-readable text is with the tags in it.  The analytic tags of
this chapter are intended for the case of a rich encoding for which in
the usual case some special filtering software will either be made
available or can be written locally.

It is of the greatest importance that the literary committee consider
carefully not only the number of bytes in the encoded linguistic
analyses but the structural characteristics of the markup, because the
structural flexibility of the tags of this chapter allows them to be
used for non-linguistic analysis of many types, and they may provide an
adequate basis for many types of specifically literary and historical
markup (e.g. identification of thematic groups, segmentation of a text
according to sources or according to narrative function, prosodic
analysis of poetry).  The definition of structures for such open-ended
problems is very difficult and very important, and I personally believe
that the linguists have done a superlative job in providing a very
powerful and flexible tool that can be exploited for literary and other
ends.  Please examine it carefully in that light!

p. 169 'narrative used in the sense of prose'.  What leads you to this
-- rather eccentric, may I say? -- interpretation?  Narrative is
certainly not always prose, prose is certainly not always narrative.
You seem to be attributing to us (or relying on yourself?) a Crocean
tripartite genre scheme -- but none such is intended here!  (And were it
appropriate for this document to take sides in genre theory, none would
be tolerated here!) Nothing of substance in 7.3.3 is specific to prose,
and nothing here is relevant to non-narrative prose (except the
observation that a great many uses of many literary texts may be taken
care of without any tags beyond those in chapter 5).  The opening two
paragraphs of 7.3.3 rather carelessly speak only of prose examples,
which you would be / are right to fault.

p. 181 On cast lists.  This is also true of English and American plays
published by Dramatist's Play Service.  It is, I think, the task of your
work group, or one that you recommend forming, to provide for this.

----------------

In conclusion, let me thank you again for your lengthy and productive
lucubrations.  I have enjoyed crossing paragraphs with you, and hope
that you will accept my forthright disagreements with you in the same
spirit of mutual cooperation in which they are meant.  I look forward
to hearing what happens at your meeting; I wish it were possible for
me to attend, but I'm not able to.

Best regards to the others in the work group.

Michael