Date: Mon, 25 Nov 1991 09:40:59 CST Sender: "TEI-L: Text Encoding Initiative public discussion list" From: Gary Simons Subject: Report of names, dates, measures subcommittee REPORT OF SUBCOMMITTEE ON NAMES, DATES, AND MEASURES 17 November 1991, Myrdal, Norway by Gary Simons 1. INTRODUCTION One of the major unresolved issues that was identified during the initial round of presentations at the meeting was the mechanism for marking up names, places, dates, and measures. Daniel Greenstein, representing the historical research community, reported that the "crystal" approach sketched in P1 was much too limited (and limiting) for the needs of his colleagues. He further observed that the feature structure mechanism developed for linguistic analysis might offer a solution. A subcommittee of Daniel Greenstein, Jacqueline Hamesse, and Gary Simons was appointed to work on this problem. Specifically, the subcommittee was asked to do the following: 1. Identify the constituent parts (or, features) which must be nameable for names, dates, places, and measures. 2. Develop examples of the encoding of the interpretation of names, dates, places, and measures in feature structure notation. Can this formalism handle alternation and uncertainty? 3. If there is still time, develop a Feature System Declaration for the markup of names, dates, places, and measures. In a nutshell, we concluded that the feature structure formalism (question 2) provides a good solution to the markup needs of historians. In as much as this solution treats the names of structures and their parts (or features) as attribute values, rather than as the names of elements and attributes which must be declared in the SGML DTD, there is no need for the TEI to answer question 1; it can be left to the historians themselves. Greenstein was not only satisfied that this solution would work, but was insistent that any solution that predefined the structures and their features would fall short of meeting the open-ended needs of the historical research community and thus not be accepted by them. As for question 3, we did not develop an FSD, but concluded that there were no problems in principle. In fact, FSDs for historical markup would generally be simpler than the sample FSD's proposed in AI1W3 for linguistic markup, since they would be less likely to use complex statements of default values or co- occurrence constraints. The substance of our report back to the full meeting consisted of the following more general points which bear on the work of other subcommittees of the TEI. 2. GENERALIZING FEATURE STRUCTURES Our solution for markup of the interpretive markup of names, places, dates, and measures is based on P1's feature structure mechanism. One change that the AI1 working group proposed last January was to simplify feature structure markup by removing the and elements and replacing them by "name" attributes in the and tags, respectively. (Another change which I think should be made, but I don't recall whether we proposed it earlier, is that the atomic value of a feature simply be text rather than being embedded in an that has no internal structure.) At the Myrdal meeting, the AI chair (Terry Langendoen), proposed a further change, namely, renaming to and to . He suggested that such a change might have the effect of making these elements more accessible to non-linguists. While these names probably do have the widest familiarity, some participants cautioned that these names might have the disadvantage of raising expectations concerning TEI as a database system and bring the project into a whole new realm for consideration of compatibility with database standards and practices. Our subcommittee did feel that Langendoen was right about the idea of renaming to make it more general and more accessible to a wide range of research communities. More neutral than would be simply (or for short) and that is the tag I will use in the remainder of this report. And what about the internal structure? A nicely generic term is , which I will also use throughout. Another possible pair of tags is and . (These names follow the SGML nomenclature and thus might be confusing, but they do serve the purpose of highlighting the analogy to SGML content elements and attributes and of providing user-definable ones which allow embedded elements within attribute values.) And note, further, that the simplified definition of feature structures makes the and tags of P1 redundant. This leaves us with the following candidates for the names of the general-purpose structure and its constituent parts: Note, too, that some degree of mixing and matching is possible. That is, there are some other combinations of names from the above two lists that might work, like and , or and . 3. COLLECTIONS OF STRUCTURES (OR RECORDS OR UNITS OR WHATEVER) Another of Langendoen's proposals was that we add to encode collections of records. We found that a mechanism like this is definitely needed for the application of historical research. As well as marking up text, historians also want to encode "data dictionaries" which contain all the names, places, dates, measures, and so on they have found and which collate all the information known about each. Some questions remain to be answered: (1) What do we call these? goes well with . What if we go for rather than record? ... ? (2) Should a record collection be any arbitrary collection of records, or should it declare a type and be constrained to contain only records of that type? (3) Where does a record collection that accompanies a marked up text go? Does it go somewhere in , or does it require a new tag or a change to the content model for ? Note that such record collections could come in handy in fields other than history. For instance, a text critical application could append a record collection to describe the features of all the witnesses referred to. A dictionary could append a record collection to give the feature structures for all of the part-of- speech tags. Each witness or part of speech would be encoded as a record (a.k.a. feature structure) and would contain a unique ID which would be referred to from the text proper by IDREFs. 4. A GENERIC SOLUTION FOR INTERPRETIVE MARKUP The approach proposed in P1 is to use tags like and in text markup, and to provide these with attributes to store interpretive information added by the analyst. Indeed, the brief given this subcommittee was to devise a list of the needed tags and the attributes needed by each. We concluded that this approach would not, in general, be satisfactory for the following reasons: (1) Historians (as demonstrated in Greenstein's recent book on information modeling in historical research) have already identified dozens of "primitive data types" (including many different types of names and dates). To provide an inventory that they feel to be adequate would require quite a proliferation of special-purpose tags. (2) Similarly, there are potentially a dozen or more attributes for each tag, further fueling the proliferation. (3) Investigators will always think of new element types or new attributes that are essential for their research but are not included in the standard set. (4) SGML attributes do not allow embedded structure, but that is essential for interpretive markup where the investigator must be able to record alternative hypotheses or attribute values which are themselves structures of a different type. Our proposal, therefore, is to avoid the above problems by using the feature structure mechanism in whatever generalized form it ends up as. Each different type of primitive data element would be encoded as a different type of feature structure. The eventual successor to the Feature System Declaration (see AI1 W3) would then specify the allowed features (or fields) for each structure (or record) type, and the allowed ranges of values. In this way the details of the special-purpose record types and associated fields needed for a particular encoding task are left to the individual researcher. These details could be specified in terms of SGML markup in the FSD rather than requiring the researcher to also understand how to extend a DTD. Communities of domain experts could propose record types and field specifications in the TEI case books, without these having to be part of the standard. Note that the adoption of the general structure/record mechanism suggests a general principle that could be used to wield Ockham's razor in the process of developing the TEI tag set, namely, whenever the tag set seems to be proliferating beyond what seems appropriate, or whenever there is significant discomfort about the likelihood of acceptance of application specific tags by specialists in that domain, the general structure/record mechanism provides a graceful alternative. For instance, at the Myrdal meeting concern was expressed about the proliferation of tags proposed by the text criticism working group and disquiet was expressed about the acceptability of the "situational context" tagging proposed by the spoken text working group. 5. EXTENDING THE "FEATURE SYSTEM DECLARATION" The notion of a Feature System Declaration (proposed in AI1W3 as an auxiliary file which encodes the semantics behind the user's use of feature structures) was endorsed in the plenary session. However, as we move toward transforming feature structures into general-purpose record-like data structures it is necessary to extend the notion of the FSD. The primary extension is that multiple structure (or record) types must be declared. The previously proposed FSD was written as though all use of feature structures reflected a single feature system. With the shift toward record-like structures, we must introduce the notion of different types of records and each different type needs a separate declaration. Thus the notion of a Feature System Declaration is extended to that of Structure Type Declarations, or Record Type Declarations. For each type, the three kinds of information specified in an FSD are given: the range of allowed values for each feature (or field), the default value in the case that a structure (or record) does not specify one of its features (or fields), and co-occurrence constraints between the values of multiple features (or fields) in a single structure (or record). For instance, if the and nomenclature is adopted: ... ... ... ... ... ... ... ... ... If the and nomenclature is adopted, tags like the following might be appropriate: ... ... ... ... ... ... ... ... ... Note that in taking the step of generalizing from linguistic feature structures to general-purpose record-like structures further extensions to the record type declaration are called for. For instance, we would probably want to be able to identify the key field (or combination of fields) for a record type in order to facilitate export/import of record collections to/from database systems. We would want to develop a richer set of range constraints than was proposed for linguistic feature structures, for instance, including integers and reals with minimum and maximum value constraints. 6. NAME, DATE, ETC. MARKUP WITH INTERPRETIVE RECORDS IN-LINE One approach to the interpretive markup of names, dates, measures, and so on would be to place the record structures directly in the text. For instance, consider the text fragment "On the third of the month we travelled to ..." ... On the third of the month 3 February 1835 we travelled to ... The includes the bit of text being interpreted as one of its parts, and then adds more parts to encode the investigator's interpretation (based on inference from context) as to the exact date being referred to. If we want to allow this style of markup, it may require that be afforded the status of a "crystal" in the TEI DTD. At one point, was allowed only in the environment of . I'm not sure whether this is still the case. 7. NAME, DATE, ETC. MARKUP WITH POINTERS TO INTERPRETIVE RECORDS The above approach injects a lot of extraneous material into the text. A cleaner approach might be to tag spans of text to be interpreted and then use a reference to the unique identifier of an interpretive record in a record collection. For instance, ... On the third of the month we travelled to ... ------- 3 February 1835 This example follows the strategy proposed in P1 of having some basic tags for name, date, and so on. As discussed above in section 4, a solution which relies solely on unique tags (including interpretive attributes) for different primitive data types does not appear to be acceptable to the historical research community. However, this hybrid solution may prove acceptable. The tag, for instance, is a generic one. It is used to tag all primitive data types pertaining to dates. The type attribute of the associated tells what specific kind of date reference it is. The tag uses an "interp" attribute to point to the interpretation. (Another possible name would be "analysis".) The above example is simple enough that the values of the parts could have been encoded as attributes of the tag. However, in general this is not the case. The values of the parts could be embedded structures, or they could involve alternatives. For instance, if it were not clear whether the year were 1835 or 1836, the relevant might be encoded as: 1835 1836 This example illustrates the problem of dealing with atomic values in lists of values. In the original formulation, the atomic values of features were coded as containing nothing but the string which was the value. On further reflection this didn't seem right and in the meeting of the AI1 working group last January we may also have recommended the creation of the tag. That proposal would have required everywhere, not just within lists. Note, however, that the and did allow bare text strings as values (which was possible because it did not allow alternations). If we want to preserve the ability of part (or features) values to be bare strings without the seeming inconsistency of tagging them as in some contexts, we could substitute a tag like , for "term in a list", which is used only within an , , or to show where a single value begins and ends. Much more complicated examples of alternatives in an analysis can be devised, and we in fact sketched some in our subcommittee meeting at Myrdal. They involved enough specialty- specific details, however, that I feel uncomfortable trying to reconstruct them. Rather, Dan Greenstein has proposed to attempt the markup of some real examples. One general principle we discovered in working these examples is that the markup scheme should allow us to distinguish between analytical hypotheses and interpretive conclusions. For instance, in dealing with a personal name in a text, one can initially associate it with an analytical structure of type "personal.name" (which could include a number of alternatives) without yet knowing who the name refers to. The ultimate conclusion would be to associate the name with a particular person, for whom there would also be a structure in the data dictionary. The pointer to the person could either be another attribute of the top-level tag, or another of the interpretive structure for the name (with an value). Finally, we must note the similarity of this markup problem to that of marking arbitrary spans of texts in literary analysis, such as to mark metaphors, for instance, and provide some analysis of them. Another subcommittee at Myrdal proposed a tag for just this purpose. Providing an attribute in which uses an IDREF to point to a record structure that has the analysis sounds like a good general solution. In fact, tags like and could be replaced by a general tag and avoid altogether the problem of what exactly this set of tags should be. In this case, the type of thing being marked (e.g. personal name versus place name versus fixed date versus metaphor, etc.) would be encoded as the "type" for the structure referred to by the "interp" of the . The generalized solution commends itself for another reason, namely, analytical alternatives can percolate up to the very top level of markup. For instance, in "I went to Rochester to get help," Rochester could be the name of a person or it could be the name of a place. If the markup scheme specifies separate tags for and , then how can this be marked up? If on the other hand, Rochester is simply enclosed in a , then its "interp" can include two IDREFs (one pointing to a and the other to a ), or it could point to a single which includes both structures. Another thing we should probably be prepared for is the possibility that the content of tags like and might need to overlap, which would not be possible. The mechanism, however, can handle that sort of thing. ..