conceptEntry: A TBX-based expansion of the TEI for the encoding of onomasiological and comparative lexical data (paper)
Jack Bowers* Jack Bowers is a research assistant at the Austrian Academy of Sciences (ÖAW)—Austrian Center for Digital Humanities (ACDH), where he is a curator of the DBÖ (Datenbank der bairischen Mundarten in Österreich) TEI corpus.He is also a PhD student at the École pratique des Hauts Études in collaboration with team ALMAnaCH at Inria (France). His PhD thesis concerns the documentation/multimedia resource/corpus creation of the Mixtepec-Mixtec language variety (spoken Juxtlahuaca district, Oaxaca, Mexico). His background in cognitive and functional approaches to all levels of linguistics and their interfaces (i.e., semantics, morphosyntax, phonetics, phonology, etymology, etc.).A major focus of his is in working on issues concerning interoperability between standards for lexical markup (TEI, LMF, ONTOLEX, TBX) and in the emerging prospects offered by semantic web/ontological resources in the integration of human knowledge across academic and scientific fields. He is a member of ISO (Austrian standards - TC 37) and is a member of the team working on developing an etymology extension for the LMF (Lexical Markup Framework).Jack holds a B.A. in History and French from San Francisco State University (2009) and an M.A. in Linguistics and a certificate in Computational Linguistics from San Jose State University (2012), Stefan Pernes* INRIA, France. Discourse- and sociolinguist who turned to digital methods. Currently working on the recognition of figurative language and the representation of encyclopaedic knowledge., and Laurent Romary* Laurent Romary is Directeur de Recherche at Inria, France, director general of the European infrastructure DARIAH, and guest scientist at the Centre Marc Bloch and the Academy of Sciences in Berlin. He carries out research on the modeling of semi-structured documents, with a specific emphasis on texts and language resources. He is the chairman of ISO committee TC 37 and has been a member (2001–2007), then chair (2008–2011), of the TEI (Text Encoding Initiative) Council and now member of the TEI board (2017–2018). Beyond his research activities, he has always been advocating for open science principles.
1In this paper we present an expansion of the ideas discussed in by Romary (2014) and
Bowers & Seltmann (2016) the primary goal of which is to re-introduce a native form of onomasiological data representation in TEI, leveraging the degree of expressivity of the TEI
and finding an optimal re-use for elements which are equivalent to TBX constructs, as they
provide a more differentiated content model and are ultimately better suited to the variability
of use cases in the context of the TEI community. Additionally, our proposals aim to
accommodate a more comprehensive variety of scenarios that may be relevant for a number
of different linguistic, lexicographic and terminology management use cases for which no
sufficient encoding scheme currently exists in TEI. Also included are proposals for the
encoding of a more diverse array of comparative lexicographical entry configurations that
may not always be onomasiological such as cognate sets and dialect data whose basis of
comparison/organization may be motivated by a common form (via common etymology)
and/or meaning.
2The core use cases upon which this work is based consist of:
- Lexicographic data which is more suited to an onomasiological representation, i.e., multi-lingual terms sharing a sense and/or conceptual domain
- Taxonomies and ontologies that require fine grained concept relations, i.e., distinguishing between generic and partitive concept hierarchies
- Inventories, field notes and other “thing-ographies” that go beyond means of encoding places and names
- Historical encyclopaedias ranging back to the eighteenth century, exhibiting large and convoluted glosses.
3As a result, we will present conceptEntry— an element based on the TEI dictionary entry,
which follows the structural requirements of the recently updated TBX standard (ISO/CD
30042) (2017) as well as extending aspects of its content model, thus supporting a broad range of
data sets and scenarios beyond termbank data exchange. The new element is intended to
be added natively to the TEI namespace and is submitted in the form of an ODD file
including extensive prose descriptions, which can also serve as its official documentation as
part of the TEI guidelines.
Bibliography
- Bowers, J., and M. Seltmann. 2016. “Exploring data models for heterogenous dialect data: the case of explore.bread.AT!” TEI Conference and Members’ Meeting. Vienna, Austria.
- ISO-CD30042. 2017. “Systems to manage terminology, knowledge and content — TermBase eXchange (TBX)..” International Organization for Standardization. Geneva, Switzerland.
- Romary, Laurent. 2014. “TBX goes TEI–Implementing a TBX basic extension for the Text Encoding Initiative guidelines.” arXiv Preprint arXiv:1403.0052. http://arxiv.org/abs/1403.0052.