Tags of the Computational Lexica Work Group
                               by
                          Robert Ingria
 
At the meeting of the Computational Lexica Work Group held in Berkeley
on June 16, 1991, it was decided to survey a broad range of lexica in
use at sites throughout the world, along three dimensions:
 
(1) syntactic information (including morphology)
(2) semantic information
(3) translation information (used in MT systems only)
 
The notes here relate the results of the survey of syntactic
information.  Because of contact with the DARPA Common Lexicon working
group, which is seeking to develop a common interchange format for
lexicons for Spoken Language Systems, some issues related to
pronunciation are also included.
 
Many of the tags included here have counterparts in the tags proposed
by the Print Dictionary Work Group.  (Reconciliation of the names is a
topic for later discussion.)  However, there are some major
differences between the types of tags needed for print dictionaries
and computational lexicons:
 
(1) There is less need for grouping tags.  As for print dictionaries,
a general grouping tag will be provided, but it is used much less
frequently in computational entries.
 
(2) Some print dictionary tags, such as etymology, are absent, while
others, such as grammar code, have expanded into several different
tags.
 
(3) Non-grouping tags from the lexicon tag set may contain tags from
other TEI tag sets, most prominently feature structures.  Thus, the
tags for lexicons may be divided into three types: ``atomic''
tags---those tags which do not allow for the inclusion of any tags
within them; ``pseudo-atomic'' tags---which only allow non-lexicon
specific tags; and ``grouping'' tags---which allow lexicon specific
tags.
 
One other major difference between the tag set for printed
dictionaries and that for computational lexicons, is the printed
dictionary tag set allows for an encoding that preserves the
appearance of the printed text.  Since there is no original text in
the case of computation lexicons, we have decided to adopt a tag set
that preserves the information in a given computational tag set but
which makes no effort to preserve the original arrangement of this
data (record structure, code, etc.).  This prevents the proliferation
of adhoc tags based on idiosyncratic features of particular lexicons.
 
The general structure of a lexical entry is as follows:
 
========================================================================
 
Definition of a lexical entry in BNF
-------------------------------------
 
entry := name spelling stem pronunciation category paradigm morphology
         lexical-features subcategorization semantics examples
         comments edit-history
 
name = an index unique to each lexical entry
 
spelling = orthography; this corresponds to the citation form in a
           print dictionary
 
stem = the actual stem operated on by the system's morphological component
 
pronunciation = phonetic representation
 
category = part of speech
 
paradigm = the general inflectional category of the lexical item
           e.g. S-*ED for a word like ``jam'' that forms its third
           person singular present by adding -s (``jams'') and its
           past tense by doubling the final consonant and adding
           -ed (``jammed'')
 
morphology = specification of irregular forms
             e.g. ``PAST = ran'' for ``run''
             [perhaps ``inflected forms'' would be a better name]
 
lexical-features = other syntactic information
                   e.g. count or mass for nouns
                        position of occurrence for adjectives in English
 
subcategorization = a specification of the complements a lexical item
                    appears with; e.g. for a verb, whether it is
                    transitive, intransitive, etc.
 
semantics = some specification of the semantics of the item.
            This seems to range from atoms to complex formulae.
 
score = a probability or other weight, specifying the
        likelihood of occurrence of this sense of the lexical item
        (this may be the raw frequency of occurrence of the entire
         lexical item, if there are no subentries)
 
examples = examples of uses of the lexical item
 
comments = miscellaneous comments by system developers and users
 
edit-history = a history of the creation and modification of the
               lexical entry
 
Specification of data type fillers for fields:
----------------------------------------------
 
name := <atom>
spelling := <atom>+ [note that this allows for collocational entries]
category := <cat>+
paradigm := <paradigm-indicator>
morphology := morph-value-pairs
lexical-features := <atomic-lexical-feature>+ | lex-feature-value-pairs
subcategorization := <atomic-subcat-feature>+ | subcat-feature-list
semantics := <semantics-feature>+
pronunciation, examples, comments, edit-history := <string>
 
Definition of fillers:
---------------------
 
morph-value-pairs := (<morph-feature> <atom>)+
lex-feature-value-paiq := (<lex-feature> <lex-value>)+
subcat-feature-list := <feature-structure>+
semantics-feature := <atomic-semantic-feature> | <feature-structure>
 
Types to be defined by specific lexicons:
----------------------------------------
 
cat, paradigm-indicator, morph-feature, atomic-lexical-feature,
lex-feature, lex-value, atomic-subcat-feature,
atomic-semantic-feature
 
We assume <feature-structure> as defined in the TEI guidelines.
 
We assume the existence of a grouping tag akin to the se/sense/grp
grouping tag of the printed dictionary group.
 
========================================================================
 
The work on the DARPA Common Lexicon has also produced the following
potential additional tags:
 
speech-transcription = speech transcription; this may be different
        from the standard orthography; e.g. SFO, the abbreviation for
        San Francisco airport might have a speech-transcription of
        ``S F O'' or S_F_O, depending on the system
 
language-model-category = category of the lexical item in the speech
        recognition system's language model.  Typically, this is a
        more semantically based category; e.g. SFO might be a proper
        noun in syntactic lexicon but a <airport> in a speech
        language model
 
What is to be done:
 
Survey the semantic and translational information to produce
commonalities.
 
Reconcile names with printed dictionary tags.
 
Translate syntax and other tag sets into actual DTDs.
 
Add examples of actual lexicon entries, in original and marked up
form.
 
Open Questions:
 
The version of this entry template specified ``spelling'' and
``category'' as the only obligatory tags, with all others optional.
Should any obligatory/optional distinction be made?
 
The pronunciation field is there for compatibility with the
(developing) DARPA Common Lexicon.  Do any other systems require this?
What constraints should be placed on its contents?  (The DARPA group is
probably going to use the Arpabet notation.)
 
At the Berkeley meeting we raised the problem that defaults pose for
unification based lexicons, which can have very short lexical entries,
which can be incomprehensible if the defaults are not documented.
Should we require the spelling out of defaults in interchange mode?
 
Is there a need for the inheritance mechanism discussed for print
dictionaries in TEI AI5 W9?  This is not needed in what has been surveyed
so far, but what about combined Natural Language/Speech Recognition
dictionaries?

[30 November 1992]