========================================================================= Date: 29 January 1991 15:21:26 CST From: "Michael Sperberg-McQueen 312 996-2477 -2981" Comment: "ACH / ACL / ALLC Text Encoding Initiative" To: "Paul A. Fortier 204 474-9841" cc: "Lou Burnard +44 (865) 273 238" Subject: preliminary notes on your critique First of all, many many thanks for your paper. It gives me great pleasure to see someone reading the guidelines with so much care and concern, and I appreciate the time that has gone into it. I believe we have some apparent disagreements which will turn out to be without substance, and some which will turn out to reflect basic differences of opinion. But I enjoyed your paper a lot all the same. I'll discuss individual points below (and possibly in a following note if I run out of time writing this today); before I do, some recurrent themes should probably be taken care of once instead of many times. 1. At various points, I agree with what you interpret the current draft as saying, and disagree over whether that's what they should say. At others, I disagree with your paraphrase of the draft and will say below 'How can you say that the draft says this?' I should say in advance and globally that I believe each passage which occasions either such comment is ipso facto in need of revision. The first set, because the draft should either be changed to say something else, or be made more persuasive. The second set, because if a passage has appeared unclear or misleading to you, it clearly needs to be reformulated, even if the reformulation does not change the substance of the utterance. 2. You find many failings in the current draft, viewed as a guide to the non-technical computer user. I will not address all of these individually, because what astonishes me is that you find so few, given that the current draft is not intended at all as a guide for non-technical users. As the preface says, 'It is clear that the Guidelines cannot achieve a wide audience without some kind of informal introductory guide, but such a guide is not yet available.' (It was punchier in a draft: 'but this is emphatically not it.') Suffice it to say that I am concentrating for the moment on what the recommendations should be, and not yet on finding formulations for them which will not mislead the naive reader. It is not too early to begin worrying about the non-technical users, but I think it will be useful to distinguish the substantive discussion over what practices should be recommended, and the pedagogical discussion over what rhetorical strategies should be marshalled in the exposition of those practices. The draft in our hands now is intended as a first step towards one part of what will require at least three parts. 1 A reference manual which lays things out in a form suitable for reference, without pedagogical structure (parts of the appendices will go there; the body of the reference manual was not ready for inclusion in the current draft). 2 A discursive presentation of the material, with a certain rhetorical coherence and topical organization, at a sometimes rather technical level. 3 An introductory guide which covers only the core issues and assumes much less computer sophistication than the other two documents (parts of the SGML tutorial would go there). What we have is a rough draft of document 2. It is not (and, I submit, probably should not be) an introduction to computer use for the non-computer-using humanist, nor an introduction to the use of word processors and editors, nor a programming text, nor an introduction to formal language theory, nor a guide to the current state of the SGML marketplace, nor the larger text-handling software marketplace. (Nor, for that matter, an introduction to analytic bibliography, literary study, or linguistics, for those in need of such introduction.) Such texts need to be separate documents. What I am learning is that they are needed now, already, by people we had not thought would need them. But they'll be done (for now, at least) as separate documents, not as additions to this one. 3. I think a lot of difference in our points of view reflects somewhat different assumptions about how texts are or will be prepared and used. Both of us are awake enough to know that text preparation and use vary widely among the community we are trying to serve, but I think we make different assumptions about the 'normal' case -- defined, if in no other way, as what we think about when we try offhand to see how some tag or other will work in practice. You assume rather explicitly I think (correct me where I err): - an unskilled coder (typically an undergraduate) who prepares the text for processing - a skilled interpreter / scholar who processes the text but does not typically change its coding or tagging - a single process of 'encoding' the text which produces the 'encoded text' and which has a distinct end point (or at most a very small number of revision processes, each pretty much discrete, finite, and clearly punctuated) - a single copy of the text, processed perhaps in various ways but (always?) using the same view of the text (same text, same tags) - (therefore) a strong case for keeping the tagging non-controversial and non-subjective I assume (also rather explicitly, if you read my forthcoming paper in L&LC): - a coder of any level of skill you like to imagine (low, high, or very low, e.g. a scanner) - an interpreter who interacts with the text and whose work on the the text may (often does) lead to changes in the encoding - an ongoing (not a finite) process of encoding the text, leading incrementally from an initial encoding to more and more elaborate encodings recording not only the input but also the output of analysis and interpretation - a single copy of the text, processed in widely varying ways and providing (through the use of filters) multiple views of the text (same or different base text, different taggings) - (therefore) a strong case for making the tag set as expressive and complete as possible -- one's new work should be able to build on one's old work or that of others, and therefore the results of one's work must be expressible in the markup Consider the comments of respondent 36, who envisages multiple levels of analysis and commentary on the compound 'winter-cearu' in Anglo-Saxon. Consider the requirements of someone who wants to encode a text so as to be able to find, later, all the words of a certain semantic field and who must therefore be able to gloss words in a systematic way (and who may wish to remain agnostic as to the correct resolution of some ambiguities in a text). This is not work which an undergraduate encoding the text will do. But to be useful for the later processing which is intended, such matter must be *encoded* as part of the encoded text. This is the most difficult problem for the scheme, from a technical point of view: ensuring that the scheme can carry as much information as we are willing and able to give it, and that information given for one purpose does not interfere with information added for another (we may want text-critical variants, for whatever reason, but wish for a while to ignore them and process the metrical characteristics of the base text alone). The current draft clearly gives far too many people the impression (far too strongly) that everything it mentions is supposed to be tagged by everyone, all the time. Since it does not actually say that, ever, and actually says the reverse from time to time, we are clearly dealing with a misconception which requires very vigorous action to prevent. So: I agree with you that the document should be clearer about the inherent optionality of virtually all the information it makes it possible to tag. I hope you will agree with me that the variety of approaches people will use (already use) in making and using electronic texts requires us to cater for both models of how texts are made and used. 4. I note with some faint disappointment that you say very little in your paper about specifically literary forms of analysis or about specifically literary text forms and what must be added to deal with them properly. ------ Now to individual items. --------------- Your section 1. --------------- p. 1 (1.1.1). You are quite right; this paragraph needs to distinguish more sharply between the use of the guidelines in deciding what features of a text to capture and their use in finding a representation for them. It does not, however, promise guidance of any kind to a 'neophyte'! p. 3 (1.1.4). I believe you've misread this section. It does *not* recommend full SGML tags in data capture; it points out, indeed, that P1 makes *no* recommendations about the specific form to be used in data capture. (It also says why: any such general recommendations must be vacuous or else inapplicable in many circumstances.) p. 4 (1.1.4). Examples of how to construct keyboard macros in one or more commonly used word processors would be appropriate for a tutorial; this document is not in a position, however, to teach readers how to use their word processor! p. 15 (2.1.4) I do not understand your reading of P1 here. I see no recommendation at all about the embedding of interpretations. The inherent characteristics of DTDs which allow us to treat them as interpretively significant are common to *all* markup and thus inescapable. p. 16 (2.1.4.2) Use of minimization is good or bad depending on what kind of computing environment one has. No working committee has yet been willing to make global recommendations for data capture or local processing, and several have fiercely resisted the suggestion that anyone should or could. If you can persuade the relevant committees that minimization should be recommended for local use, more power to you. For now, it is simply a technique which should be illustrated, with others, in tutorials and cookbook presentations of how to use the TEI. Me, I find it invaluable for local use, but I have seen enough other people's computing environments to be wary of quick generalization. p. 16 which is too wordy? the text or the tagging? p. 55 (4.1.4) Are you saying that scholars in some fields refer to books with different pagination and lineation as the 'same edition'?! This is new to me; if the book is reset, I was taught to call it a new edition. Publisher (or printer, in books to about 1700) and date would then uniquely determine the item in question. Certainly the usual scholarly and bibliothecal practice as encoded in the MLA style book and the relevant library standards appear to find city, publisher, and date sufficient. Can you expand on your remark? (Also, N.B. this brief discussion includes exemplary items of information only; the recommendations for what to include and what to regard as optional are in section 4.3). p. 65 (4.5) If local conventions survive into the interchange format of a text, you are right. But I did not imagine that they were supposed to do so. p. 77 (5.2.5) Colophon. Yes, some puzzled readers have asked what this is. I don't know any alternative term, however, so perhaps we just need to define it at greater length. p. 77 (5.3.1) and passim. I am puzzled by your insistence on the need not merely to provide notation for, but to *recommend* the encoding of line breaks. They are seldom the object of literary research (at most, of research in analytic bibliography); they are typically ignored silently in critical editions (except in diplomatic editions of manuscript texts, which are the exception and not the rule); they are very seldom used in citing texts in critical studies (the MLA style sheet does not require them; the Chicago Manual of Style does not mention them, let alone recommend them or require them). If line breaks are *required* for literary work, why do the manuals of scholarly practice not mention them? Why do the articles in MLA and JEGP and MLN and Romance Philology not use them? I am willing to agree that it is useful to be able to tag them, but the evidence of current scholarly practice (not to mention the practice of most encoders, as I have seen it, -- I make the assumption that they encoded what they felt they needed to encode) suggests that they are merely useful, no more. (All this of course applies only to prose; in verse texts the importance of line marking is obvious and explicitly stated.) p. 125 (5.11.2) 'downplayed' how? I am not sure what kinds of layout information you are arguing for, nor how you believe they are being downplayed. I cheerfully confess that I think rather less highly of the importance of capturing the physical page in the usual case (and I think the practice of critical editions, which typically record no information on the layout of the copy text, suggests that the consensus of scholarship doesn't regard detailed layout information as so important, either). Here, however, I see no downplaying at all. Not everyone is interested in physical layout; for those who are, there are tags. p. 178 (7.3.1.2) on rhyme. Since the description of rhyme annotation is offered as an example only, I miss the prescriptive tone you disapprove of. Your point about French is quite important however; what is the usual notation for rhyme schemes among Romance metrists? p. 200 (8.4.1) This entire chapter needs thorough reworking in any case. pp. 207-209 (A.1) I maintain my view that the lineation of the Signet paperback of Jane Eyre has no intrinsic or scholarly value. I don't know how you got the notion that copyright was an issue here; we transcribed this chapter from the volume I had on my shelf because it was handy. Any prescriptive tutorial on tagging would do well, of course, to use a standard edition. --------------- Your section 2. --------------- General. You are very right about the lack of precision in defining and distinguishing types and levels of TEI conformance, requirements, recommendations, etc. This was originally left vague partly because it was vague in the deliberations of the committees and partly in the belief that defining a set of T.E.I. Recommendations with capital R would reify the guidelines too much and obscure the essential flexibility. There was much fear that funding agencies might apply any too-clearly defined Recommendations as Procrustean requirements for all projects whether appropriate for them or not. A certain mistiness of definition seemed one way to counter this danger. Discussion since the appearance of the draft has taught me that more clarity, not less clarity, is needed, and funding agencies need to be told very explicitly what are and what are *not* appropriate ways to measure the use a project may make of the TEI scheme. Your suggestions for grouping are worth considering; we should talk about them viva voce. p. 1 (1.1.2) I don't see where you see the 'grudging' acceptance that interchange and local processing are two different things. What one does in the privacy of one's own CPU is one's own business and no concern of this draft. That is why no recommendations are made about data capture! Why *recommend* the use of shorter tags, though? If people want them, they can have them; why need we recommend them? pp. 45-52 (3.2) character sets. Clearly we do need more clarity on the interchange/local processing distinction throughout this chapter (though this is where the concepts have been made most precise!) -- but how is one to SHOW the character set code used in a Mac or PC? pp. 82-83 (proper names, abbreviations). That these are intended to be optional is shown by the use of 'may'. Here and passim, obviously, allowance must be made for people's tendency to fear the worst. passim. Your suggestions for examples are good ones. Thank you. --------------- Your section 3. --------------- I believe your distinction between encoding and interpretation, while a useful point of departure for individual cases, breaks down in practice when one attempts to apply it globally. The distinction between capital and lowercase letters, and spelling in general, is interpretive in many manuscripts (is that a capital letter in the manuscript, or only a slightly malformed uncial?). The line breaks (in the metrical sense) of Beowulf are the topic of scholarly interpretation and research (Robert Creed found it a publishable result that he was able to formulate rules for deciding where to place the verse break in controversial cases). You argue, if I understand you, that scholars don't want their encoders deciding that something is in quotes because it's ironic or because it's really quoted, because they want to do that themselves. How are they to record their decisions, however, if we provide no tags for expressing such distinctions? Don't tell me that no one wishes to build further work on such distinctions: there *are* concordances which distinguish between the vocabulary of the author and that of the authorities quoted by the author -- one need only consider the Index Thomisticus. It is inescapable that some people wish to tag such items one way, and others another way, and that the TEI must therefore provide tags for both approaches. Since the more interpretive tags may be reduced to their less-interpretive equivalents, the provision of the interpretive information does *not* bar the reinterpretation of the data by others. And since the richer the text, the more sensitive can be our processing, I believe the current draft is right to say that the richer encoding is preferable where it is practicable and appropriate. (And n.b. more than that, the current draft does not say. I have taken heat on this subject from readers who consider us to have caved in to the presentation-oriented taggers.) p. 105 (5.8.1) explicit tagging of sentences. No such tagging is proposed or given here. The tag given here is for marking arbitrary segments of text (s = segment); the discussion explains how it may be used to tag *orthographic* sentences. In your dichotomy between the objective and the interpretive, I believe orthographic sentences (those explicitly marked by end punctuation in the copy text) ought to fall on the former side. p. 105 (5.8.2) on removal of quotation marks. If you show me a single text encoded by anyone before 1985 which retains the rendition of quotation marks (distinguishing opening from closing marks, and making clear from the file itself which form they took, of the 20 or so forms that quotation marks take in European publishing), then I will buy you a beer in Tempe. Two beers. Bear in mind that before the advent of microcomputers, almost no vendor-provided character set possessed distinct characters for open and close quotation marks. No standard character set does, that I know of, to this day. The text is quite explicit, I thought, that such 'removal' of data is to be contemplated only where the removal is reversible (hence the use of the word 'redundant'). p. 124 (5.11.1) Clarissa example. To clutter an example of treatment of italics by inclusion of page and line boundaries would be a bit much, I should have thought. Why would 'Anglice``' be Italian? Would the Italian not be '[per] inglese'? You are right that it is difficult to decide with certainty why the italics are used. But that is why this example is used to show how to encode the mere fact of italics when a further distinction is not appropriate! p. 214 (Hamlet). The problem you note is a real one, though I would have said it's a problem with the taxonomy of stage directions proposed (a rather silly one) rather than with the notion of analytic markup. --------------- Your section 4. --------------- p. 76 (5.2.4) touche''. p. 105 (5.8.2) I understand the committee's recommendation to mean 'Do with hyphens what you would do were you editing the text or typesetting it: respect or suppress.' I agree that a bit more clarity and further discussion would be welcome. p. 110 (5.10.3) this is 'humorous'? Remind me to tell you some other good ones sometime! It is difficult to find a real example with the requisite features of brevity, simplicity, and a range of manuscript configurations permitting exposition of nested and non-nested, overlapping and discontiguous variants. If anyone finds a suitable real example, I shall be very happy. (And buy them a beer.) p. 129 What is a mug's game? (Other than poetry, I mean.) pp. 140-44 (6.2.4) You seem to be assuming that the only way to look at a machine-readable text is with the tags in it. The analytic tags of this chapter are intended for the case of a rich encoding for which in the usual case some special filtering software will either be made available or can be written locally. It is of the greatest importance that the literary committee consider carefully not only the number of bytes in the encoded linguistic analyses but the structural characteristics of the markup, because the structural flexibility of the tags of this chapter allows them to be used for non-linguistic analysis of many types, and they may provide an adequate basis for many types of specifically literary and historical markup (e.g. identification of thematic groups, segmentation of a text according to sources or according to narrative function, prosodic analysis of poetry). The definition of structures for such open-ended problems is very difficult and very important, and I personally believe that the linguists have done a superlative job in providing a very powerful and flexible tool that can be exploited for literary and other ends. Please examine it carefully in that light! p. 169 'narrative used in the sense of prose'. What leads you to this -- rather eccentric, may I say? -- interpretation? Narrative is certainly not always prose, prose is certainly not always narrative. You seem to be attributing to us (or relying on yourself?) a Crocean tripartite genre scheme -- but none such is intended here! (And were it appropriate for this document to take sides in genre theory, none would be tolerated here!) Nothing of substance in 7.3.3 is specific to prose, and nothing here is relevant to non-narrative prose (except the observation that a great many uses of many literary texts may be taken care of without any tags beyond those in chapter 5). The opening two paragraphs of 7.3.3 rather carelessly speak only of prose examples, which you would be / are right to fault. p. 181 On cast lists. This is also true of English and American plays published by Dramatist's Play Service. It is, I think, the task of your work group, or one that you recommend forming, to provide for this. ---------------- In conclusion, let me thank you again for your lengthy and productive lucubrations. I have enjoyed crossing paragraphs with you, and hope that you will accept my forthright disagreements with you in the same spirit of mutual cooperation in which they are meant. I look forward to hearing what happens at your meeting; I wish it were possible for me to attend, but I'm not able to. Best regards to the others in the work group. Michael