List of Common Morphological Features For Inclusion in TEI Starter Set Of Grammatical-Annotation Tags Document Number: TEI AI1W2 June 11, 1991 (17:55:56) Draft June 11, 1991 (17:55:56) ABSTRACT This document lists features intended for inclusion in the TEI "starter set" of tags for grammatical annotation. It includes lists of word classes and features which some modern European language expresses mor- phologically (where "morphology" includes typographic or other presenta- tional features like capitalization). It does not include all the fea- tures found useful by those who have analyzed machine-readable texts in the past; notably, purely syntactic and purely semantic features are not included. 1 INTRODUCTION This document lists features intended for inclusion in the TEI "starter set" of tags for grammatical annotation. It includes lists of word classes and features which some modern European language expresses morphologically (where "morphology" includes typographic or other pres- entational features like capitalization).(1) It does not include all the features found useful by those who have analyzed machine-readable texts in the past; notably, purely syntactic and purely semantic fea- tures are not included. This indicates an attempt to focus on the core of widely-agreed-upon features, and in no way indicates an opinion as to the relative importance, for any purpose, of the features included or excluded here. Most researchers providing grammatical annotation of texts will add features to this list, and this practice should be under- stood as a natural use of the list and the scheme it represents; it should not be discouraged. The responsibility for the usefulness or otherwise of the grammatical annotation provided for a given text remains, as it should, with the annotator. An attempt has been made to provide conventional widely-used termi- nology. For sets of feature values, in particular, the set given is the union of the sets commonly used for the languages in question.(2) For this and other reasons (for example, in a particular language, not all the features or values are relevant), an encoding for any given language will use only a subset of the features and values given. Mechanisms for specifying what values are to be available for a given encoding and for specifying restrictions on well-formed feature structures are under development and are described elsewhere (document AI 1 W3). The next section, which is the main body of this work, is made up of four parts. 1. "Values for Underspecification of Features" contains the values to be used for features which the encoder wishes to underspecify. 2. "Word-Level Features and Values" contains the definition of fea- tures at word-level, which at present contains an internally- structured class and a specification of word-form type. 3. "Recurrent Features and Values" contains a set of properties that recur within the specification of multiple classes, though gener- ally subject within any given language to some class-specific restrictions; such restrictions are not enforceable by SGML, but may be documented by the feature system declaration mentioned above and enforced by any application which can read the feature system declaration. 4. "Individual Word Classes" contains the idiosyncratic contents of the individual word classes. 2 FEATURES AND VALUES Values for Underspecification of Features To allow various forms of underspecification and non-specification of feature values, the following values are to be understood as possible values for any feature, with the senses given: any: compatible with all of the listed values (not including the others in this list) default: the feature has a specific value based on an analysis, but the value is not stated ?: feature applies but value is unknown (may also be indicated by omis- sion of the feature) n-a: feature does not apply (may also be indicated by omission of the feature) no claim: analyst is not certain whether the feature applies or not; in any case, no value is known (may also be indicated by omission of the feature) (omission): either the feature does not apply or its value is unknown (ambiguous; to disambiguate or to specify that no disambiguation is possible, use the values n-a, ?, or no claim) Word-Level Features and Values Five features can be assigned to anything to be treated as a word: * category = noun | pronoun | adjective | article | verb | adverb | preposition | coordinator | subordinator | particle | interjection | punctuation(3) * form = phrase | compound | portmanteau | full | reduced | clitic | proclitic | enclitic * components = list of feature structures of the components of the word(4) * source-class = feature structure for the appropriate category(5) * initial-capital = + | - Recurrent Features and Values This section lists features which appear in more than one category. Their range of values is specified here; under each appropriate catego- ry, they are listed by their feature-name only. * person = 1 | 2 | 2.polite | 2.familiar | 3(6) * number = singular | dual | plural * case = nominative | genitive | dative | accusative | vocative | ablative | locative | instrumental | subjective | objective | prepo- sitional(7) * gender = feminine | masculine | neuter | common | invariant(8) * animacy = + | - * degree = positive | comparative | superlative * definiteness = definite | indefinite | specific | nonspecific | gen- eric * deixis = proximal | distal | remote | near-speaker | near-hearer(9) * affect = diminutive | augmentative(10) * polarity = positive | negative * function = demonstrative | relative | interrogative * directionality = static | dynamic(11) Individual Word Classes Nouns If category = noun, then the following features may apply, with pos- sible values as specified here or in section "Recurrent Features and Values". * number * case * gender * animacy * affect * definiteness * deixis(12) * polarity * proper = + | -(13) * unit = + | - Pronouns If category = pronoun, then the following features may apply, with possible values as specified. * person * number * gender * case * animacy * deixis * polarity * function * possessive = + | - * anaphora = reflexive | reciprocal * type = personal | indefinite | expletive | partitive | locative | propredicate | zero | referential | emphatic | demonstrative(14) * pro-form = disjunctive | conjunctive Adjectives If category = adjective, then the following features may apply, with possible values as specified. * number * case * gender * animacy * degree * polarity * numeral = cardinal | ordinal * pronominal = a pronoun feature structure * declension = strong | weak | long | short(15) Articles If category = article, then the following features may apply, with possible values as specified. * number * gender * case * definiteness Verbs If category = verb, then the following features may apply, with possible values as specified. * polarity * agreement = a feature structure consisting of: - person - number - gender - case * tense = present | past | future | imperfect | aorist | future-aorist | perfect | pluperfect | future-perfect(16) * aspect = perfective | imperfective | progressive | durative(17) * voice = active | middle | passive | mediopassive * mood = indicative | imperative | subjunctive | conditional(18) * verb-form = finite | infinitive | gerund | supine | participle | present-participle | past-participle | future-participle(19) * p-incorporation = a feature structure consisting of: - direct-object = feature structure for personal pronoun - indirect-object = feature structure for personal pronoun - object = a feature structure for personal pronoun(20) - oblique = partitive | locative(21) * verb-type = auxiliary | modal | lexical | copula * reflexive = + | -(22) Adverbs If category = adverb, then the following features may apply, with possible values as specified.(23) * degree * deixis * polarity * function * directionality * type = locative | temporal | manner | quantity | contrast(24) Prepositions If category = preposition, then the following features may apply, with possible values as specified. * directionality * polarity Portmanteau forms which include a prepositional component, such as French du of the (masc.), can be encoded using the portmanteau value for the form feature. If the word as a whole is to be marked as a preposition, then it should be specified category = preposition. Alternatively, no category may be assigned to the word as a whole, but one of its components should be specified cat- egory = preposition in either case. Coordinators If category = coordinator, then the following feature may apply: * polarity Subordinators If category = subordinator, then the following feature may apply: * polarity Particles If category = coordinator, then the following features may apply: * polarity * directionality Interjections No feature structure is specified for interjections. Punctuation If category = punctuation, then the following feature may apply, with the values indicated. * orientation = open | close | matched | unary 3 OTHER CATEGORIES No separate treatment is given for other grammatical categories or word classes, such as determiners or quantifiers, since no consensus was reached and there appears to be little in the community. The standard examplars of these classes can be assigned to other classes (article, adjective, pronoun, adverb in some cases) and the issues here tend to be sufficiently theory-dependent to make it unclear how to define these classes and their internal structure in general terms. Absence of these classes as an articulated structure here should not be taken to mean disapproval by the work group as a whole of the speci- fication of structures for them and their treatment as separate classes. 4 OTHER TAGS The existing TEI tags , , appear to be usable for the objects they describe without extension. Accordingly, no provision is made here for marking these as feature structures in the manner practised by some grammatical annotation schemes. ------------------------- (1) The languages explicitly considered in this first pass at the list are Danish, Dutch, English, French, German, Greek, Italian, Portu- guese, Russian, Spanish: that is, Russian and the languages of the European Community. Other languages are expected to be included in later revisions, refinements, or supplements. A listing of features for Japanese and Korean is in preparation. (2) For example, the values for the case feature includes both nomina- tive and subjective, which may be thought of as alternate names for the same value. (3) The name category was felt by the working group to be more appropri- ate than word-class. (4) The feature components would normally be specified only if the fea- ture form has the value phrase, compound or portmanteau. This meth- od of marking portmanteaux was suggested by Geoffrey Sampson, and supersedes the method described in the first draft of this working paper. (5) The source-class feature provides for the marking of feature struc- tures for a second word class (e.g. for the verbal features of a participle, when the outer feature structure is for an adjective) and takes any feature structure as its value. (6) The values 1 inclusive and 1 exclusive are not included, as they are not relevant to the languages under consideration. Geoffrey Sampson raised the question of the adequacy of these values for marking the range of Portuguese second-person forms including tu, voce;, o senhor and Voce;ncia. One possibility would be to mark the first 2.familiar, the second 2, and the third and fourth 2.polite. The markup being developed for Japanese and Korean politeness dis- tinctions would be adequate to distinguish among all four of these cases. (7) The value prepositional is needed for Russian, and may also be used to mark the special pronominal forms in Portuguese that are governed by prepositions, such as mim. (8) The value neuter is to be used for distinctive neuter gender as in German and Dutch. Common gender, which is sometimes referred to as neuter gender in grammars of Romance languages, can be marked any, if only the values feminine and masculine are explicitly declared; or common, if that value is also explicitly declared. (9) The values near-speaker and near-hearer are needed for representing the distinctions among Portuguese demonstratives and locative adverbs, such as aqui here by me, a; there by you and ali there away from us. The first could be tagged near-speaker, the second near-hearer and the third distal. (10) The Romance languages, especially, have a large number of affixes that can be classified as diminutive or augmentative. Portuguese, for example, has at least four diminutive affixes and three augmen- tative ones. Many words derived with these affixes are understood as terms of endearment, i.e. as caritatives, or of insult, i.e. pejoratives. We decided to include the feature affect as it is formally marked in these and other European languages, but not with values indicating endearment or insult, since the latter cannot be said to be morphologically marked in any consistent way. A partic- ularly nice illustration is provided by Italian, in which the diminutive suffix uccia results in a word understood as a term of endearment when it is added to names of persons, e.g. Mariuccia dear Mary; but in a word understood as a term of insult when it is added to common nouns, e.g. casuccia wretched house. (11) To mark the contrast between Danish ind versus inde, English in versus into, etc. (12) For cases such as French femme-ci, femme-la; etc. (13) Correlates with initial-capital feature feature in some languages. (14) The referential value is for marking the Dutch form die he, she, it, they, which perhaps can be better analyzed as an invariant (for number, gender and case) demonstrative pronoun. (15) The feature declension indicates which declensional pattern is found in a occurrence, where more than one is possible, for example the strong versus weak patterns found in German prenominal adjec- tives depending on the associated determiner elements, or the Rus- sian long versus short declension of predicative adjectives. It is not intended to be used to distinguish patterns of inflection to which lexical items are idiosyncratically assigned, like the dif- ference between first and second declensions in traditional Latin grammar. Like other information fixed for the lemma, a feature like declension-class can be supplied by the analyst as an addition to this list. (16) Other terms in common use for some of the values of the tense fea- ture are: past-definite for past; past-aorist for aorist; present- perfect for perfect; past-perfect and past-anterior for pluperfect; and conditional for future-perfect. (17) Certain "complex tenses" such as pluperfect can alternatively be rendered as combinations of simple tense and aspect specifications. For example, the effect of the assignment tense = pluperfect could be rendered by the two assignments tense = past and aspect = per- fective. Moreover, if aorist is considered as a value for aspect instead of as a value for tense, then additional complex tenses could similarly be analyzed as combinations. For example, the assignment tense = future-aorist would be equivalent to the two assigments tense=future and aspect=aorist. (18) We limit the use of the value conditional to the feature mood, to indicate possibility, obligation, etc. To indicate a future time prior to some other time in the future, we use the value future- perfect rather than conditional for the feature tense. (19) The assignment verb-form = present-participle may alternately be represented by the two assignments verb-form = participle and tense = present. The personal-infinitive in Portuguese may be distin- guished from the impersonal-infinitive by the presence of the agreement feature. (20) Use the feature object if the function of the incorporated pronoun (as direct or indirect object) is not known or not specified. (21) The oblique feature is given simple, rather than complex, values on the assumption that there is no need to encode more structure for such incorporated elements as French y and en at the level of sophistication involved in creating the grammatical feature starter set. (22) The feature reflexive is provided as a simple alternative to the use of the p-incorporation feature for the marking of grammatically reflexive verbs. (23) No special treatment is proposed for compound adverbs made up of a deictic adverb and a preposition, such as German dabei, darauf; Dutch erin; English herein, thereby; etc. (24) Some of these values have been added because of their use in stan- dard grammars, e.g. quantity in Spanish and contrast in Danish. However, not all of them are morphologically marked, so strictly speaking those should not have been included in this set. Draft June 11, 1991 (17:55:56)