List of Common Morphological Features
For Inclusion in TEI Starter Set
Of Grammatical-Annotation Tags
Document Number: TEI AI1W2
June 11, 1991 (17:55:56)
Draft June 11, 1991 (17:55:56)
ABSTRACT
This document lists features intended for inclusion in the TEI "starter
set" of tags for grammatical annotation. It includes lists of word
classes and features which some modern European language expresses mor-
phologically (where "morphology" includes typographic or other presenta-
tional features like capitalization). It does not include all the fea-
tures found useful by those who have analyzed machine-readable texts in
the past; notably, purely syntactic and purely semantic features are not
included.
1
INTRODUCTION
This document lists features intended for inclusion in the TEI
"starter set" of tags for grammatical annotation. It includes lists of
word classes and features which some modern European language expresses
morphologically (where "morphology" includes typographic or other pres-
entational features like capitalization).(1) It does not include all
the features found useful by those who have analyzed machine-readable
texts in the past; notably, purely syntactic and purely semantic fea-
tures are not included. This indicates an attempt to focus on the core
of widely-agreed-upon features, and in no way indicates an opinion as to
the relative importance, for any purpose, of the features included or
excluded here. Most researchers providing grammatical annotation of
texts will add features to this list, and this practice should be under-
stood as a natural use of the list and the scheme it represents; it
should not be discouraged. The responsibility for the usefulness or
otherwise of the grammatical annotation provided for a given text
remains, as it should, with the annotator.
An attempt has been made to provide conventional widely-used termi-
nology. For sets of feature values, in particular, the set given is the
union of the sets commonly used for the languages in question.(2) For
this and other reasons (for example, in a particular language, not all
the features or values are relevant), an encoding for any given language
will use only a subset of the features and values given. Mechanisms for
specifying what values are to be available for a given encoding and for
specifying restrictions on well-formed feature structures are under
development and are described elsewhere (document AI 1 W3).
The next section, which is the main body of this work, is made up of
four parts.
1. "Values for Underspecification of Features" contains the values to
be used for features which the encoder wishes to underspecify.
2. "Word-Level Features and Values" contains the definition of fea-
tures at word-level, which at present contains an internally-
structured class and a specification of word-form type.
3. "Recurrent Features and Values" contains a set of properties that
recur within the specification of multiple classes, though gener-
ally subject within any given language to some class-specific
restrictions; such restrictions are not enforceable by SGML, but
may be documented by the feature system declaration mentioned
above and enforced by any application which can read the feature
system declaration.
4. "Individual Word Classes" contains the idiosyncratic contents of
the individual word classes.
2
FEATURES AND VALUES
Values for Underspecification of Features
To allow various forms of underspecification and non-specification of
feature values, the following values are to be understood as possible
values for any feature, with the senses given:
any: compatible with all of the listed values (not including the others
in this list)
default: the feature has a specific value based on an analysis, but the
value is not stated
?: feature applies but value is unknown (may also be indicated by omis-
sion of the feature)
n-a: feature does not apply (may also be indicated by omission of the
feature)
no claim: analyst is not certain whether the feature applies or not; in
any case, no value is known (may also be indicated by omission of the
feature)
(omission): either the feature does not apply or its value is unknown
(ambiguous; to disambiguate or to specify that no disambiguation is
possible, use the values n-a, ?, or no claim)
Word-Level Features and Values
Five features can be assigned to anything to be treated as a word:
* category = noun | pronoun | adjective | article | verb | adverb |
preposition | coordinator | subordinator | particle | interjection |
punctuation(3)
* form = phrase | compound | portmanteau | full | reduced | clitic |
proclitic | enclitic
* components = list of feature structures of the components of the
word(4)
* source-class = feature structure for the appropriate category(5)
* initial-capital = + | -
Recurrent Features and Values
This section lists features which appear in more than one category.
Their range of values is specified here; under each appropriate catego-
ry, they are listed by their feature-name only.
* person = 1 | 2 | 2.polite | 2.familiar | 3(6)
* number = singular | dual | plural
* case = nominative | genitive | dative | accusative | vocative |
ablative | locative | instrumental | subjective | objective | prepo-
sitional(7)
* gender = feminine | masculine | neuter | common | invariant(8)
* animacy = + | -
* degree = positive | comparative | superlative
* definiteness = definite | indefinite | specific | nonspecific | gen-
eric
* deixis = proximal | distal | remote | near-speaker | near-hearer(9)
* affect = diminutive | augmentative(10)
* polarity = positive | negative
* function = demonstrative | relative | interrogative
* directionality = static | dynamic(11)
Individual Word Classes
Nouns
If category = noun, then the following features may apply, with pos-
sible values as specified here or in section "Recurrent Features and
Values".
* number
* case
* gender
* animacy
* affect
* definiteness
* deixis(12)
* polarity
* proper = + | -(13)
* unit = + | -
Pronouns
If category = pronoun, then the following features may apply, with
possible values as specified.
* person
* number
* gender
* case
* animacy
* deixis
* polarity
* function
* possessive = + | -
* anaphora = reflexive | reciprocal
* type = personal | indefinite | expletive | partitive | locative |
propredicate | zero | referential | emphatic | demonstrative(14)
* pro-form = disjunctive | conjunctive
Adjectives
If category = adjective, then the following features may apply, with
possible values as specified.
* number
* case
* gender
* animacy
* degree
* polarity
* numeral = cardinal | ordinal
* pronominal = a pronoun feature structure
* declension = strong | weak | long | short(15)
Articles
If category = article, then the following features may apply, with
possible values as specified.
* number
* gender
* case
* definiteness
Verbs
If category = verb, then the following features may apply, with possible
values as specified.
* polarity
* agreement = a feature structure consisting of:
- person
- number
- gender
- case
* tense = present | past | future | imperfect | aorist | future-aorist
| perfect | pluperfect | future-perfect(16)
* aspect = perfective | imperfective | progressive | durative(17)
* voice = active | middle | passive | mediopassive
* mood = indicative | imperative | subjunctive | conditional(18)
* verb-form = finite | infinitive | gerund | supine | participle |
present-participle | past-participle | future-participle(19)
* p-incorporation = a feature structure consisting of:
- direct-object = feature structure for personal pronoun
- indirect-object = feature structure for personal pronoun
- object = a feature structure for personal pronoun(20)
- oblique = partitive | locative(21)
* verb-type = auxiliary | modal | lexical | copula
* reflexive = + | -(22)
Adverbs
If category = adverb, then the following features may apply, with
possible values as specified.(23)
* degree
* deixis
* polarity
* function
* directionality
* type = locative | temporal | manner | quantity | contrast(24)
Prepositions
If category = preposition, then the following features may apply,
with possible values as specified.
* directionality
* polarity
Portmanteau forms which include a prepositional component, such as
French du of the (masc.), can be
encoded using the portmanteau value for the form feature. If the word
as a whole is to be marked as a preposition, then it should be specified
category = preposition. Alternatively, no category may be assigned to
the word as a whole, but one of its components should be specified cat-
egory = preposition in either case.
Coordinators
If category = coordinator, then the following feature may apply:
* polarity
Subordinators
If category = subordinator, then the following feature may apply:
* polarity
Particles
If category = coordinator, then the following features may apply:
* polarity
* directionality
Interjections
No feature structure is specified for interjections.
Punctuation
If category = punctuation, then the following feature may apply, with
the values indicated.
* orientation = open | close | matched | unary
3
OTHER CATEGORIES
No separate treatment is given for other grammatical categories or
word classes, such as determiners or quantifiers, since no consensus was
reached and there appears to be little in the community. The standard
examplars of these classes can be assigned to other classes (article,
adjective, pronoun, adverb in some cases) and the issues here tend to be
sufficiently theory-dependent to make it unclear how to define these
classes and their internal structure in general terms.
Absence of these classes as an articulated structure here should not
be taken to mean disapproval by the work group as a whole of the speci-
fication of structures for them and their treatment as separate classes.
4
OTHER TAGS
The existing TEI tags , , appear to be
usable for the objects they describe without extension. Accordingly, no
provision is made here for marking these as feature structures in the
manner practised by some grammatical annotation schemes.
-------------------------
(1) The languages explicitly considered in this first pass at the list
are Danish, Dutch, English, French, German, Greek, Italian, Portu-
guese, Russian, Spanish: that is, Russian and the languages of the
European Community. Other languages are expected to be included in
later revisions, refinements, or supplements. A listing of features
for Japanese and Korean is in preparation.
(2) For example, the values for the case feature includes both nomina-
tive and subjective, which may be thought of as alternate names for
the same value.
(3) The name category was felt by the working group to be more appropri-
ate than word-class.
(4) The feature components would normally be specified only if the fea-
ture form has the value phrase, compound or portmanteau. This meth-
od of marking portmanteaux was suggested by Geoffrey Sampson, and
supersedes the method described in the first draft of this working
paper.
(5) The source-class feature provides for the marking of feature struc-
tures for a second word class (e.g. for the verbal features of a
participle, when the outer feature structure is for an adjective)
and takes any feature structure as its value.
(6) The values 1 inclusive and 1 exclusive are not included, as they are
not relevant to the languages under consideration. Geoffrey Sampson
raised the question of the adequacy of these values for marking the
range of Portuguese second-person forms including tu, voce;, o senhor and Voce;ncia. One possibility would be to mark the
first 2.familiar, the second 2, and the third and fourth 2.polite.
The markup being developed for Japanese and Korean politeness dis-
tinctions would be adequate to distinguish among all four of these
cases.
(7) The value prepositional is needed for Russian, and may also be used
to mark the special pronominal forms in Portuguese that are governed
by prepositions, such as mim.
(8) The value neuter is to be used for distinctive neuter gender as in
German and Dutch. Common gender, which is sometimes referred to as
neuter gender in grammars of Romance languages, can be marked any,
if only the values feminine and masculine are explicitly declared;
or common, if that value is also explicitly declared.
(9) The values near-speaker and near-hearer are needed for representing
the distinctions among Portuguese demonstratives and locative
adverbs, such as aqui here by
me, a; there by you
and ali there away from us.
The first could be tagged near-speaker, the second near-hearer and
the third distal.
(10) The Romance languages, especially, have a large number of affixes
that can be classified as diminutive or augmentative. Portuguese,
for example, has at least four diminutive affixes and three augmen-
tative ones. Many words derived with these affixes are understood
as terms of endearment, i.e. as caritatives, or of insult, i.e.
pejoratives. We decided to include the feature affect as it is
formally marked in these and other European languages, but not with
values indicating endearment or insult, since the latter cannot be
said to be morphologically marked in any consistent way. A partic-
ularly nice illustration is provided by Italian, in which the
diminutive suffix uccia results in a word
understood as a term of endearment when it is added to names of
persons, e.g. Mariuccia dear
Mary; but in a word understood as a term of insult when it
is added to common nouns, e.g. casuccia
wretched house.
(11) To mark the contrast between Danish ind
versus inde, English in versus into, etc.
(12) For cases such as French femme-ci, femme-la; etc.
(13) Correlates with initial-capital feature feature in some languages.
(14) The referential value is for marking the Dutch form die he, she, it, they, which
perhaps can be better analyzed as an invariant (for number, gender
and case) demonstrative pronoun.
(15) The feature declension indicates which declensional pattern is
found in a occurrence, where more than one is possible, for example
the strong versus weak patterns found in German prenominal adjec-
tives depending on the associated determiner elements, or the Rus-
sian long versus short declension of predicative adjectives. It is
not intended to be used to distinguish patterns of inflection to
which lexical items are idiosyncratically assigned, like the dif-
ference between first and second declensions in traditional Latin
grammar. Like other information fixed for the lemma, a feature
like declension-class can be supplied by the analyst as an addition
to this list.
(16) Other terms in common use for some of the values of the tense fea-
ture are: past-definite for past; past-aorist for aorist; present-
perfect for perfect; past-perfect and past-anterior for pluperfect;
and conditional for future-perfect.
(17) Certain "complex tenses" such as pluperfect can alternatively be
rendered as combinations of simple tense and aspect specifications.
For example, the effect of the assignment tense = pluperfect could
be rendered by the two assignments tense = past and aspect = per-
fective. Moreover, if aorist is considered as a value for aspect
instead of as a value for tense, then additional complex tenses
could similarly be analyzed as combinations. For example, the
assignment tense = future-aorist would be equivalent to the two
assigments tense=future and aspect=aorist.
(18) We limit the use of the value conditional to the feature mood, to
indicate possibility, obligation, etc. To indicate a future time
prior to some other time in the future, we use the value future-
perfect rather than conditional for the feature tense.
(19) The assignment verb-form = present-participle may alternately be
represented by the two assignments verb-form = participle and tense
= present. The personal-infinitive in Portuguese may be distin-
guished from the impersonal-infinitive by the presence of the
agreement feature.
(20) Use the feature object if the function of the incorporated pronoun
(as direct or indirect object) is not known or not specified.
(21) The oblique feature is given simple, rather than complex, values on
the assumption that there is no need to encode more structure for
such incorporated elements as French y and
en at the level of sophistication involved
in creating the grammatical feature starter set.
(22) The feature reflexive is provided as a simple alternative to the
use of the p-incorporation feature for the marking of grammatically
reflexive verbs.
(23) No special treatment is proposed for compound adverbs made up of a
deictic adverb and a preposition, such as German dabei, darauf; Dutch
erin; English herein, thereby; etc.
(24) Some of these values have been added because of their use in stan-
dard grammars, e.g. quantity in Spanish and contrast in Danish.
However, not all of them are morphologically marked, so strictly
speaking those should not have been included in this set.
Draft June 11, 1991 (17:55:56)