Tags of the Computational Lexica Work Group by Robert Ingria At the meeting of the Computational Lexica Work Group held in Berkeley on June 16, 1991, it was decided to survey a broad range of lexica in use at sites throughout the world, along three dimensions: (1) syntactic information (including morphology) (2) semantic information (3) translation information (used in MT systems only) The notes here relate the results of the survey of syntactic information. Because of contact with the DARPA Common Lexicon working group, which is seeking to develop a common interchange format for lexicons for Spoken Language Systems, some issues related to pronunciation are also included. Many of the tags included here have counterparts in the tags proposed by the Print Dictionary Work Group. (Reconciliation of the names is a topic for later discussion.) However, there are some major differences between the types of tags needed for print dictionaries and computational lexicons: (1) There is less need for grouping tags. As for print dictionaries, a general grouping tag will be provided, but it is used much less frequently in computational entries. (2) Some print dictionary tags, such as etymology, are absent, while others, such as grammar code, have expanded into several different tags. (3) Non-grouping tags from the lexicon tag set may contain tags from other TEI tag sets, most prominently feature structures. Thus, the tags for lexicons may be divided into three types: ``atomic'' tags---those tags which do not allow for the inclusion of any tags within them; ``pseudo-atomic'' tags---which only allow non-lexicon specific tags; and ``grouping'' tags---which allow lexicon specific tags. One other major difference between the tag set for printed dictionaries and that for computational lexicons, is the printed dictionary tag set allows for an encoding that preserves the appearance of the printed text. Since there is no original text in the case of computation lexicons, we have decided to adopt a tag set that preserves the information in a given computational tag set but which makes no effort to preserve the original arrangement of this data (record structure, code, etc.). This prevents the proliferation of adhoc tags based on idiosyncratic features of particular lexicons. The general structure of a lexical entry is as follows: ======================================================================== Definition of a lexical entry in BNF ------------------------------------- entry := name spelling stem pronunciation category paradigm morphology lexical-features subcategorization semantics examples comments edit-history name = an index unique to each lexical entry spelling = orthography; this corresponds to the citation form in a print dictionary stem = the actual stem operated on by the system's morphological component pronunciation = phonetic representation category = part of speech paradigm = the general inflectional category of the lexical item e.g. S-*ED for a word like ``jam'' that forms its third person singular present by adding -s (``jams'') and its past tense by doubling the final consonant and adding -ed (``jammed'') morphology = specification of irregular forms e.g. ``PAST = ran'' for ``run'' [perhaps ``inflected forms'' would be a better name] lexical-features = other syntactic information e.g. count or mass for nouns position of occurrence for adjectives in English subcategorization = a specification of the complements a lexical item appears with; e.g. for a verb, whether it is transitive, intransitive, etc. semantics = some specification of the semantics of the item. This seems to range from atoms to complex formulae. score = a probability or other weight, specifying the likelihood of occurrence of this sense of the lexical item (this may be the raw frequency of occurrence of the entire lexical item, if there are no subentries) examples = examples of uses of the lexical item comments = miscellaneous comments by system developers and users edit-history = a history of the creation and modification of the lexical entry Specification of data type fillers for fields: ---------------------------------------------- name := spelling := + [note that this allows for collocational entries] category := + paradigm := morphology := morph-value-pairs lexical-features := + | lex-feature-value-pairs subcategorization := + | subcat-feature-list semantics := + pronunciation, examples, comments, edit-history := Definition of fillers: --------------------- morph-value-pairs := ( )+ lex-feature-value-paiq := ( )+ subcat-feature-list := + semantics-feature := | Types to be defined by specific lexicons: ---------------------------------------- cat, paradigm-indicator, morph-feature, atomic-lexical-feature, lex-feature, lex-value, atomic-subcat-feature, atomic-semantic-feature We assume as defined in the TEI guidelines. We assume the existence of a grouping tag akin to the se/sense/grp grouping tag of the printed dictionary group. ======================================================================== The work on the DARPA Common Lexicon has also produced the following potential additional tags: speech-transcription = speech transcription; this may be different from the standard orthography; e.g. SFO, the abbreviation for San Francisco airport might have a speech-transcription of ``S F O'' or S_F_O, depending on the system language-model-category = category of the lexical item in the speech recognition system's language model. Typically, this is a more semantically based category; e.g. SFO might be a proper noun in syntactic lexicon but a in a speech language model What is to be done: Survey the semantic and translational information to produce commonalities. Reconcile names with printed dictionary tags. Translate syntax and other tag sets into actual DTDs. Add examples of actual lexicon entries, in original and marked up form. Open Questions: The version of this entry template specified ``spelling'' and ``category'' as the only obligatory tags, with all others optional. Should any obligatory/optional distinction be made? The pronunciation field is there for compatibility with the (developing) DARPA Common Lexicon. Do any other systems require this? What constraints should be placed on its contents? (The DARPA group is probably going to use the Arpabet notation.) At the Berkeley meeting we raised the problem that defaults pose for unification based lexicons, which can have very short lexical entries, which can be incomprehensible if the defaults are not documented. Should we require the spelling out of defaults in interchange mode? Is there a need for the inheritance mechanism discussed for print dictionaries in TEI AI5 W9? This is not needed in what has been surveyed so far, but what about combined Natural Language/Speech Recognition dictionaries? [30 November 1992]