Tei

The Linked TEI: Text Encoding in the Web

TEI Conference and Members Meeting 2013: October 2-5, Rome (Italy)

Abstracts of panels and round tables

Computer-mediated communication in TEI: What lies ahead

Introduction

The social web has brought forth various genres of interpersonal communication (computer-mediated communication, henceforth: cmc) such as chats, discussion forums, wiki talk pages, Twitter, comment and discussion threads on weblogs and social network sites. These genres display linguistic and structural peculiarities which differ both from speech and from written text. Projects that want to build and exchange cmc corpora would greatly benefit from a standard that allows the user to annotate these peculiarities in TEI.

From the perspective of several corpus projects which aim at building and annotating cmc corpora for several European languages, this panel will discuss how the models provided by the TEI encoding framework may be adapted to the special requirements of cmc genres.

The basis of the discussion is a customized TEI schema presented at the TEI conference held in WĂŒrzburg 2011 (Beißwenger et al. 2012)1. The panel papers will elaborate on basic features that a TEI standard for cmc resources should include and outline open issues with which further work will have to deal.

The overall goal of the panel is to stimulate the discussion within the TEI community about how a standard for the representation of cmc in TEI should look like and what might be a practical and reasonable way to go about creating such a standard.

In order to push the development of a general standard for the representation of cmc genres and cmc discourse forward, the papers in the panel will present problem overviews for basic issues in representing cmc features in TEI P5 and outline perspectives as well as first suggestions for the treament of these challenges through modifications and expansions of the encoding framework. Starting from these suggestions, the group is planning to work out feature requests and load them onto the TEI projects page on sourceforge.net.

After a general introduction, paper 1 asserts that solutions for the representation of cmc in TEI should be included in the official TEI guidelines and not remain a task that research and corpus projects have to solve using individual customizations. In addition, the paper formulates general requirements a framework for the representation of cmc (in TEI) should comply with as well as specific requirements from several projects which are currently building corpora of cmc discourse for four European languages (Dutch, German, French, and Italian).

Taking into account the requirements outlined in paper 1, paper 2 starts wich an overview of existing suggestions for the representation of basic structural and linguistic features of cmc discourse in the TEI framework. It then presents considerations on the following open issues: (1) the modeling of different types of citations in cmc postings; (2) the modeling of hypermedia features (hyperlinks and linking structures, embedded media objects); (3) challenges related to the representation of discourse in multimodal cmc environments in which the participants in one interaction space combine a variety of modalities from written, spoken and non-verbal modes.

Paper 3 examines the issue of metadata. It discusses general requirements for representing metadata of cmc resources and outlines a proposal for representing cmc metadata in the TEI framework.

The panel will include 30 minutes of discussion time (15 minutes each after paper 2 and 3).

Paper 1: Modeling computer-mediated communication in TEI: requirements and perspectives

This paper reports an ongoing work in a network of corpus projects which aim at building and annotating corpora of computer-mediated communication (cmc)2 and asserts that a framework for the representation of cmc should become a part of the TEI guidelines. It gives an overview of research fields in the Humanities and Computer Sciences which would benefit from the availability of such a representation framework and outlines the basic requirements it will have to comply with:

  • The schema should provide a general model for the description of the structural and linguistic peculiarities of cmc discourse.
  • To be useful for a broad range of application contexts in the Humanities, it should not be designed with one single project in mind but it should take into account the specific requirements of several projects (and genre typologies) in which the creation of annotated cmc resources is of interest.
  • In order to be suitable for small data sets which are annotated manually and also for the annotation of big data (e.g., reference corpora in Linguistics, large web corpora in the field of Natural Language Processing), its basic structure should be defined in a way that favours or supports (at least partially) automatic annotation procedures.
  • The schema should build on a review of models which already exist in the TEI framework (currently TEI P5) and adapt them to the peculiarities of cmc genres in a reasonable and practical way.
  • It should reflect the fact that CMC shares characteristics with written text as well as with spoken conversation while at the same time it is significantly different from both in its textual form and in the mode of production and reception.
  • It should allow for an easy (and reversible) anonymization of cmc resources for purposes in which they shall be made available for other researchers (e.g., in the case of reference corpora).
  • It should allow for an easy referencing of random samples of the resource (e.g., for citation in scientific publications, didactic materials or dictionary articles).

Since papers 2 and 3 of the panel take into consideration the goals and needs of several projects which are currently dealing with the construction of corpora of cmc discourse in four European languages, paper 1 includes a brief presentation of the four projects and an outline of their project-specific requirements for an annotation schema:

  • DeRiK (“Deutsches Referenzkorpus zur internetbasierten Kommunikation”) is a joint project of TU Dortmund University and the Berlin-Brandenburg Academy of Sciences (BBAW) and the Humanities which is building a reference corpus of German cmc discourse including the most prominent cmc genres. The DeRiK corpus will form a new component of the reference corpora of contemporary written German collected in the BBAW project “Digitales Wörterbuch der deutschen Sprache” (DWDS). On the one hand, it is designed as a resource for corpus-based linguistic analyses of language use in German cmc as well as – in combination with the DWDS corpus – of the impact of cmc genres on contemporary written German. On the other hand it will serve as a resource for the lexicographic description of “netspeak” vocabulary and cmc-specific processes of lexical-semantic change in the dictionary component of the DWDS online lexical information system3 (cf. Beißwenger et al. 2013). For annotation, DeRiK is currently using the customized TEI schema for cmc described in Beißwenger et al. (2012). The schema comprises, among others, an element for the description of user contributions to cmc conversations (the divLike element posting), a distinction of two major types of cmc macrostructures (the cmc-specific division types ‘thread’ and ‘logfile’), a component for modeling the authors of cmc postings as well as elements for the annotation of selected “netspeak” features in individual user postings (emoticons, interaction words, interaction templates, addressing terms).
  • The Dutch reference corpus SoNaR was intended to serve as a general reference for studies involving language and language use. The corpus should provide a balanced account of the standard language and the variation that occurs within it. In doing so, it allows researchers investigating language use in a particular domain (e.g. medicine) or register (e.g. academic writing) or by a speciïŹc group (e.g. professional translators) to relate their data and ïŹndings to the general reference corpus. The corpus was also intended to play a role in the benchmarking of tools and annotations. Collected in 2008-2012 the corpus contains 500 Mwords, including discussion lists, e-magazines, websites, Wikipedia, SMS, chats and tweets. SoNaR is delivered in the FoLiA format (van Gompel 2012). FoLiA aims to support a wide variety of linguistic annotations in a generic paradigm and has been successfully adopted by various projects in The Netherlands. To provide support for new media, a type of structure annotation called “event annotation” was added, which fits nicely in the paradigm. SoNaR incorporates support for tweets, chat logs and SMS. The former two have been encoded as events, in which each tweet or chat message constitutes an event. Within the event structure, further subdivisions can optionally be made, such as paragraphs, sentences, words (in case of tokenized data). Elements in FoLiA carry a class from a certain set. In this way flexibility is provided to the user. The sets can be formally defined. The events in SoNaR are assigned classes such as “tweet” or “chatmessage”. The actors of the set are also explicitly annotated, and further metadata on the annotation is also supported.
  • LETEC (“Learning & Teaching Corpora”). Mulce repository4 is a databank of LETEC corpora built upon online learning situations (Reffay, Betbeder & Chanier, 2012). All interactions among participants have been collected and structured before their analysis. It assembles a large variety of cmc types: email, forums, chat, blogs, 3D environments with audio and text chats, etc. One of the main components of its XML structure (Mulce-struct)5 is the workspace. It includes descriptions of its members as references to the participants registered in the learning activity, starting and ending dates, the tools and the interaction tracks or acts that occurred using these tools. Each cmc tool has a detailed and specific structure. Large subparts of the LETEC databank will be integrated in 2013-14 into a nationwide cmc corpus in French where other cmc types, such as SMS, tweets, Wikipedia forums, will be added. The cmc SIG group leading the project belongs to the national consortium “IR corpus-Ă©crits” in charge of building a reference corpus in French. The cmc SIG has designed a working package which will take care of the cmc TEI structure6 of the whole corpus and work jointly with the European colleagues gathered in this panel.
  • Web2Corpus_it (“Corpus italiano di comunicazione mediata dal computer”) is a project funded by Sapienza University of Rome in 2010 aimed at investigating meaning negotiation strategies in cmc. It focuses on conversational, interactive, public, written communication in order to build a genre-balanced cmc corpus of Italian language to be investigated both qualitatively and quantitatively. The genres included are: forum, blog, newsgroup, social network and chat (cf. Chiari and Canzonetti, in press)7. The collected corpus comprises one million words and has been fully anonymized (by masking), in order to avoid personal details of participants being disclosed, and xml-annotated both for macro-structural properties (thread, post, sender details – avatar | signature | nickname | senderplace – subject, date, time, links and embedded media, web action elements and cmc-specific emoticons and tags and addressing terms). At present the corpus is being processed linguistically with a statistical POS tagger and lemmatizer, including a reference machine dictionary (Common Lexicon of Italian) developed in order to include cmc specific lexical items, and will be subsequently manually checked and is planned to be released in late 2013.

These four corpus projects will provide the test bed for an evaluation of the models under construction with cmc discourse from different languages.

Paper 2: Expanding the TEI encoding framework to genres of computer-mediated communication: considerations and suggestions

The first section of this paper presents some basic suggestions for the expansion of the TEI encoding framework to the structural and linguistic particularities of cmc genres. It takes into account the general requirements as well as the project-specific requirements outlined in paper 1 and builds on the customized TEI schema for cmc which has been presented at the 2011 TEI members’ meeting (published in Beißwenger et al. 2012). The suggestions describe features for the modeling of corpus documents with stored discourse from cmc genres such as online forums, chats, wiki talk pages, Twitter, weblogs or social network sites and (amongst others) refer to the following basic issues in the description of cmc:

  • the representation of user postings in written cmc as units which share characteristics with both text and conversations: under aspects of planning and coherence, they are designed as moves in an ongoing conversation; under the aspect of production and reception they behave just like texts, which first have to be produced and then are presented to and received by the addressee(s) en bloc;
  • the need for models for the representation of cmc macrostructures (= the way how series of user postings are grouped / presented to the users, e.g., in the form of logfiles, different types of threads, timelines etc.);
  • the need for elements for the annotation of cmc-specific structural and linguistic features on the microlevel of cmc discourse (= the content of the postings which comprises e.g. typical “netspeak” phenomena such as emoticons, action words, addressing terms; hashtags; speedwriting phenomena, phenomena of non-standardized writing; embedded hyperlinks and media objects etc.);

With the help of examples from the corpus projects introduced in paper 1, the second section of the paper will offer problem sketches of the following open issues in modeling cmc and outline some first ideas for their treatment in TEI:

  • Handling citations: Especially in forums and Bulletin Boards, cmc postings often contain (simple and nested) citations which reproduce content that has originally been part of other authors’ prior postings. A schema for the representation of cmc should include a model for the annotation of citations and for referencing citations with the cited prior postings and their authors.
  • Cmc data as hyperlinked data: Many cmc resources contain hyperlinks and linking structures. A framework for the representation of cmc interactions must include models for the description of how postings are linked with each other and/or with other interaction-external resources on the internet. In some cmc applications (e.g., micro-blogging sites such as Twitter) the method of displaying one and the same user posting as part of a sequence may vary depending on the user’s choice (cf. e.g. on Twitter the timeline of one author’s tweets vs. the timeline of tweets by different authors which include the same hashtag). A general model for cmc resources must provide features for the description of these kinds of structures and of the target sources of the hyperlinks.
  • Dealing with data from multimodal cmc environments: In some cmc environments users are communicating not only in a text-based mode but using a combination of text-, audio-, video- and/or 3D-based modalities of interaction (e.g., e-learning platforms, Skype, gaming environments, virtual worlds etc.). One of the challenges related to the representation of cmc discourse recorded in environments of that kind is that contributions created and sent in one modality may contribute to, and indeed supplement, a contribution in another modality. In audio-graphic conferencing environments such as Skype, written postings sent via chat may contribute to an ongoing spoken conversation in the audio modality. In collaborative writing environments, written postings in the chat may contribute to the creation of a longer stretch of text in the word-processing modality. One challenge of treating cmc discourse of that kind is thus the necessity to integrate and align user contributions made in different modalities into a representation of the overall multimodal interaction. Since TEI provides modules not only for written but also for (transcriptions of) spoken discourse, the different modes could be represented separately (using different TEI modules) while the alignment of the utterances and postings in the different modalities would have to be solved in an additional representation which is connected with the different resources.

Paper 3: Metadata for cmc documents

Extensive and correct metadata has been recognized to be a crucial property of every data object that is used as a primary data source in research contexts. Fine grained metadata allow for identification, location and management of resources (e.g., NISO, 2004) but also provide researchers with crucial information regarding the suitability of a given resource for their particular research interest. The TEI header recognizes all of these metadata requirements to different degrees (Burnard 2005).

Our paper will have a strong focus on the encoding of intrinsic properties of different cmc data sets, thus addressing the issue of finding resources which are suitable for a given research question. Ideally, this part of the metadata description is based on the model representing the primary data. In this respect our paper strongly relies on paper 2, which will propose such a model for cmc data.

An example of cmc-specific data types are emoticons: small, iconic representations of an interlocutor’s emotion or his/her attitude towards an utterance (either self produced or produced by other speakers) or towards a communication peer, to name just some of their communicative functions. It is therefore worth considering to either encode normalization and classification schemes for those entities within the metadata description or to provide pointers to such schemes in addition to a suitable markup of these entities within the primary data.

Cmc data often contains large portions verbosely cited material from previous parts of the discourse. This creates a challenge to the measurement of the extent of a given resource. Depending on the assumed discourse status of cited (parts) of utterances it may be necessary to include or exclude cited material. This is a theory-dependent decision, and it should therefore be possible to give concurrent values for a single unit of measurement. Moreover, metadata information on (the handling of) citations may – to some extent – be derived from the primary data directly (see paper 2 for handling of citations in primary text).

Distinct ypologies for cmc tools (including tools that were used to access the primary data) and cmc genres are needed to account for the broad range of different data sources, e.g., online forums, chats, wikis, Twitter, weblogs, social network sites, learning environments and others. We will suggest mechanisms of referencing a particular typology of cmc genres from within the metadata, however, without making any regulations on which kind of typology should be used and referenced in a given project.

Special care must be taken in the metadata description of information about discourse participants to ensure privacy and/or anonymity of the speakers involved in the discourse. Moreover, specific metadata for cmc should also have the function of restoring context information about features of the communication mode of production and reception of cmc texts that are not evident in the text itself. This involves features such as the temporal structuring of the discourse (synchronous vs. asynchronous mode), conversational hierarchies among discourse participants (e. g. blog author vs. commentator), discourse topic/domain or accessibility of the discourse (e. g. private vs. closed vs. public). The availability of social and other context information varies greatly, not only in quantity but also in its quality, according to the primary data source. Therefore a cmc metadata scheme will have to account for different levels of reliability for such information.

Considering the given fourfold structure of the TEI header (file description, encoding description, text profile and revision description), we will identify and discuss different possibilities for recording metadata properties that are specific for cmc data:

  • Cmc data comprise properties found in traditional written resources (such as books or newspapers) as well as properties found in resources of (transcribed) spoken language. Both types of resources have previously been provided with TEI-based metadata. Properties shared across different resource types can be expected to be reusable for cmc metadata, e.g., listPerson to denote discourse participants or profileDesc to describe general discourse settings.
  • Some metadata properties that cannot be readily encoded using specific elements can still be recorded using the generic feature structure representation (fs). Embedding of feature structures is currently allowed for a limited set of header elements in the TEI such as classCode, extent, language, scriptNote and typeNote. Exploiting the semantic linking mechanism provided by att.datcat (via the ISOcat data category registry; note that classCode provides a native semantic interface via @scheme as well) would allow tailor-made semantics for the properties encoded in such a way. But obviously this adds a level of indirection and does not capture these properties within the TEI directly.
  • A third possibility lies in the adaptation of the TEI element inventory or of suggested cmc-specific value sets for existing elements. For individual projects this can already be achieved by TEI customizations but it may hinder interoperability across resources using elements not found in the TEI guidelines – which is another argument for why models for the representation of cmc data in TEI should better be part of the official guidelines and not be something that each project needs to solve individually.

We will conclude the paper with a proposed metadata header for TEI documents encoding cmc data. We will also – at least for some prominent features of metadata for cmc documents – show how the TEI header metadata are related to, and can be converted to, metadata components within the emerging CLARIN Metadata Framework (Component Metadata Infrastructure, CMDI).

Bibliography

  • Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2012): A TEI Schema for the Representation of Computer-mediated Communication. Journal of the Text Encoding Initiative, Issue 3. http://jtei.revues.org/476 (DOI: 10.4000/jtei.476).
  • Beißwenger, Michael; Ermakova, Maria; Geyken, Alexander; Lemnitzer, Lothar; Storrer, Angelika (2013): DeRiK: A German Reference Corpus of Computer-Mediated Communication. In: Literary and Linguistic Computing (LLC).
  • Burnard, Lou (2005): Metadata for corpus work. In: Martin Wynne (ed.): Developing Linguistic Corpora: A Guide to Good Practice. Oxford, 30-46.
  • Chiari, Isabella; Canzonetti, Alessio (in press): Le forme della comunicazione mediata dal computer: generi, tipi e standard di annotazione. In: Enrico Garavelli & Elina Suomela-HĂ€rmĂ€ (eds.): Dal manoscritto al web: canali e modalitĂ  di trasmissione dell’italiano. Tecniche, materiali e usi nella storia della lingua. Atti del XII Convegno della SocietĂ  Internazionale di Linguistica e Filologia Italiana (SILFI, Helsinki 18-19 June 2012), Franco Cesati Editore, Firenze.
  • [NISO 2004] National Information Standards Organization (2004): Understanding Metadata. http://www.niso.org/publications/press/UnderstandingMetadata.pdf
  • Oostdijk,  Nelleke; Reynaert, Martin; Hoste, VĂ©ronique; Schuuman, Ineke (2013): The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch. In: Peter Spyns & Jan Odijk (eds): Essential Speech and Language Technology for Dutch. Springer. http://link.springer.com/chapter/10.1007/978-3-642-30910-6_13
  • Reffay, Christophe; Betbeder, Marie-Laure; Chanier, Thierry (2012): Multimodal Learning and Teaching Corpora Exchange: Lessons learned in 5 years by the Mulce project. Special Issue on dataTEL: Datasets and Data Supported Learning in Technology-Enhanced Learning. International Journal of Technology Enhanced Learning (IJTEL) 4 (1/2), 11-30. http://edutice.archives-ouvertes.fr/edutice-00718392 (DOI: 10.1504/IJTEL.2012.048310).
  • [TEI P5] TEI Consortium (eds) (2007). TEI P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Guidelines/P5/ (accessed 22 March 2013).
  • van Gompel, Maarten (2012). FoLiA: Format for Linguistic Annotation. Documentation. ILK Technical Report 12-03. Available from http://ilk.uvt.nl/downloads/pub/papers/ilk.1203.pdf

The role of the TEI in the establishment of a European shared methodology for the production of scholarly digital editions

While it cannot be denied that the TEI represents an important point of reference for the preparation of digital editions of culturally important texts of all kinds, its influence remains somewhat more marginal than should ideally be the case. From an encoding point of view, despite many improvements made in the last few years (see for instance the new mechanisms for documentary and genetic encoding), there are still a few ’grey areas’, one of the more obvious being the critical apparatus module (Chapter 12 of the Guidelines), which has several clear gaps and flaws and has been widely criticised in recent years (see e.g. Burghart and Rosselli Del Turco 2012).

More worrying and probably more impactful, is however the lack of easy­to­use tools supporting the encoding process and the subsequent management of the encoded files. The question in this case is if these tools are yet to come or whether they will be ever coming (see Pierazzo 2011).

Another major drawback in the general adoption of the TEI by the scholarly editorial community is perhaps represented by the final delivery of the edition once the encoding process is finished. There are a few tools readily available, such as the TEI stylesheets and the TEI Boilerplate, but they are limited, not very easy to customise without specific knowledge and not really suitable for high spec, complex digital editions.

And yet the TEI has undeniably played a vital role in shaping the intellectual agenda with respect to scholarly digital editions. Why does it still meet with resistance from scholars engaged in the production of editions? In 2011 a European­wide network called NeDiMAH (Network for Digital Methods in the Arts and Humanities) was launched with the purpose of “carrying out a series of activities and networking events that will allow the examination of the practice of, and evidence for, digital research in the arts and humanities across Europe” (see www.nedimah.eu/). The Network is supported by the European Science Foundation and involves representatives from Bulgaria, Croatia, Denmark, Finland, France, Germany, Ireland, the Netherlands, Norway, Portugal, Romania, Sweden, Switzerland and the United Kingdom. Within NeDiMAH a working group has been set up specifically devoted to Scholarly Digital Editions in seeking to promote international cooperation and to highlight best practices and areas of improvement both in terms of methodologies and IT infrastructure (see http://www.nedimah.eu/workgroups/scholarly­digital­editions). Following a very successful expert seminar in The Hague (see http://www.nedimah.eu/events/nedimah­expert­meeting­digital­scholarly­editions) where theoretical and practical issues connected with the production and consumption of scholarly digital editions have been debated, the working group proposes a round­table specifically focused on the role of the TEI within the theory and practice of scholarly digital editing. The main topics that will be covered are:

The apparatus criticus: How and why? The TEI offers three different formats for encoding variants, but it seems that only “parallel segmentation” has been used in practice by the TEI users. This method has several drawbacks (for instance, with many witnesses the markup between excessively complex, with much overlapping of lemmas inevitable), but it seems to be the only one that allows for any sort of implementation. The other two methods, on the other hand, in spite of being far more flexible, require a considerable effort in the development of any processing tools, based as they are on stand­off markup.

What is really the function of the critical apparatus? The TEI Guidelines seem to imply that it works like a repository of variants. A proper apparatus criticus is far more than that, however: it is the key to understanding why the text presented is what it is. More precisely, the apparatus is a set of notes designed to foster in the reader an awareness of historical and editorial processes that resulted in the text s/he is reading and to give the reader what s/he needs to evaluate the editor’s decisions. Is this vision present or even possible within the Guidelines?

Tools: In this context, we will discuss the potential impact of outreach targeting tool developers from outside the strict TEI community. Could we offer developers more or less unfamiliar with the TEI a low­threshold introduction, less overwhelming than the Guidelines? This would of course require some recommendations for “best practice”. Burghart proposes a series of “cheatsheets” (Burghart 2011), offering digests of TEI encoding recommendations starting from the user experience. These could serve not only as a guide to the guidelines for end­users, but could also be of great help to developers to understand the concepts / phenomena their users want to encode.

More generally we will discuss the TEI intellectual leadership and responsibilities in the field of digital scholarly editing.

Participants

  • M. J. Driscoll, KĂžbenhavns Universitet, Chair of the NeDiMAH working group on digital scholarly editions
  • Elena Pierazzo, King’s College London, Co­chair of the NeDiMAH working group on digital scholarly editions
  • Marina Buzzoni, UniversitĂ  Ca’ Foscari Venezia
  • Marjorie Burghart, L’École des hautes Ă©tudes en sciences sociales, Lyon
  • Cynthia Damon, University of Pennsylvania
  • Patrick Sahle, UniversitĂ€t zu Köln

Bibliography

TAPAS and the TEI: An Update and Open Discussion

The TEI Archiving, Publishing, and Access Service (TAPAS) is now entering its second year of development, with the goal of supporting the publication and archiving of small-scale scholarly TEI projects. A prototype is now being tested which supports a set of core functions including the creation of projects and collections, upload of TEI data, creation of metadata and transfer of metadata from existing TEI files, configuration of the publication interface, and various ways of exploring TAPAS collections. An intensive user testing period is scheduled for the end of April 2013, and an additional period of user testing will be conducted during July and August 2013. An initial release of the service is planned for early 2014. TAPAS is also exploring a relationship with the TEI Consortium that would make TAPAS a benefit of TEI membership, and that would take advantage of TAPAS to offer discounted TEI workshops and supporting services to TEI members.

At the TEI annual conference in 2012 at Texas A&M University, Julia Flanders gave a presentation on TAPAS that sought to elicit ideas and comments from the TEI community concerning the role TAPAS might play in supporting the creation, publication, and long-term archiving of TEI data. The resulting discussion offered input a number of important issues that have had significant impact on the shape of TAPAS: for instance, the suggestion that TAPAS might serve as a kind of community corpus or teaching corpus for TEI data, the issue of divergent encoding practices within TAPAS data, and the question of how to handle migration to future versions of the TEI Guidelines. Following a year of further development, it is important for TAPAS to receive further input from the TEI community and to provide updated information on the project’s development.

This session will begin with three short presentations from panelists that offer an updated view of progress on TAPAS, as follows:

  1. Julia Flanders will present an update on the technical and strategic development of TAPAS, including the architecture of the service, the business model, and the process of user testing.
  2. Syd Bauman will present a detailed examination of the TAPAS schemas and their design, and will report on information gathered through the profiling of TEI data contributed to TAPAS.
  3. Elena Pierazzo will present an update on the relationship between TAPAS and the TEI, focusing on the development of a memorandum of understanding and the planning of TAPAS services as TEI member benefits.

Following these presentations, the session will provide approximately 45 minutes for open discussion. The following questions will be suggested as starting points but any topics raised by audience members will be welcome:

  • Can TAPAS be made sustainable as a benefit of TEI membership?
  • How can TAPAS better serve the international TEI community? is its scope too limited?
  • What are the highest priority features for TAPAS to offer its contributors?
  • What are the highest priority features from the reader’s perspective? What will make TAPAS a useful resource about the TEI?

Dialogue and linking between TEI and other semantic models

The deep dialogue TEI started with other semantic models – i.e. CIDOC-CRM and FRBR/FRBR (OO) has two aims: the data and documents interchange and the improvement of the editors possibilities to formally declare hermeneutical positions. The TEI schema provides most of the elements/attributes (and classes) useful to describe interpretation instances, while further schemas, as well as other value vocabularies and metadata element sets, are supposed to enhance some potentialities of the model itself. On one hand, additional schemas could contribute to perfect the scope of some TEI elements, while on the other, the existing ontologies could improve the interpretation effectiveness. Therefore, this panel is aimed at introducing three different approaches to document representation, where TEI may draw some hints from other models.

We first present the contribute of EAC (Encoded Archival Context) to extend people’s description, starting from the archival approach to the context, here intended as the key element to define individual’s roles and functions. Then we considered the dialogue between TEI and the existing ontologies, with particular attention to geographic data. Finally, thanks to the ‘semantic lenses’ employed as an exploratory tool for annotated documents, we started up the relationship between TEI and specific ontologies related to semantic publishing.

The aforementioned approaches adopt a linked data perspective, adding the TEI element with @ref and URI and adopting the RDF model for assertions. By exposing TEI annotations as data sets, we could improve both the schema and the documents interchange with other exiting data sets, enhancing the information retrieval possibilities. Digital editions based on TEI could start a dialogue with the WWW resources in a global vision of heritage, here intended as cultural data connection, where digital editions, acting like a sort of interlink between literary texts, archival documents and books, play a crucial role in the preservation of cultural memory.

TEI <person> versus EAC: the identity between functions and context

Amongst the most significant changes in the TEI P5 Schema version, the Biographical and Prosopographical Data [1] section undoubtedly constitutes a challenging innovation. TEI decided to invest on ‘persons’, defining an elements taxonomy useful to describe individuals. In 2006 a special workgroup called ‘Personography’ was chartered: its task was “to investigate how other existing XML schemes and TEI customization handle data about people” [2] and a “Report on XML mark-up of biographical and prosopographical data” was published [3].

A basic approach to describing people consists in the unique individuals’ identification and the description enrichment through features classification. However, we must never forget that people are strongly connected with the textual context: as a result, roles and functions, intended as individuals’ features, naturally change depending on the context, i.e. on the source attesting the individual. It’s therefore possible to state that: 1) some features not only are static over the time but they are also theoretically constant in relation to the context (i.e. birth, death, nationality, persName); 2) other features vary depending on date and place (i.e age, affiliation, education, event, state); 3) roles and functions (i.e. author, actor, editor, speaker) are elements that identify people depending on the context.

Thus we can say that a person is a complex entity, because she/he is connected with different phenomena typologies: some are unchangeable, while some depend on a time period, a place or a context. In any case, all these features are able to turn a string into a concept, that is an assertion resulting from the relation between the elements needed to provide meaning.

The <person> element in TEI could be associated with different roles or functions. Let’s consider the digital edition of a literary text. We may say that a person is, respectively: the one who created the digital edition – at different levels -, the analogic source author, the printed version editor, the whole of individuals quoted in the text. The concept of person extends its domain: although individuals are strictly related to the source constituting their appropriate semantic background, they are also entities with a function enabling a single person to connect either with different documents – or other resources in general – and several persons with other people sharing the same role. Multiple relationships therefore arise: between individuals, between a person and a document in which she/he is mentioned and between a person and other resources.

This reflection links TEI to one peculiar XML schema, called EAC (Encoded Archival Context) [4] developed in order to formalize the ISAAR (CPF) standard (International Standard Archival Authority Record for Corporate Bodies, Persons and Families)[5] and today represented also as ontology [6]. EAC contributes to the reasoning on individuals, pointing out the importance both of the context and the relationships. The approach here described aims to extend the domain of digital editions to the archival studies one. The archival science declares the principle of separation between the description of records (documents) and the description of people (corporate bodies, persons and families) [7], focusing on the context as a key element. The same approach could be mostly implemented in TEI, if the final purpose is to expose data sets to be used by the Web community.

It becomes then essential to consider EAC as a schema able to suggest how to extend the concept of <relation> in TEI. EAC (CPF) is based on the principle of entity intended as corporate body, person, or family that manage relationships – between entities and between one entity and a resource linked at some level – each of which could be described, dated and categorized. Besides the elements connected to the “relation” principle (<cpfRelation> and <resourceRelation>), EAC describes the <function> element that “provides information about a function, activity, role, or purpose performed or manifested by the entity being described” [4] on a specific date. The element <functionRelation> describes a “function related to the described entity. [...] Includes an attribute @functionRelationType” that could support a values taxonomy 4].

A new model of authority record, intended as complex structure able to document the context in which the identity is attested, could be introduced: the authority is generated not only by the controlled form of the name, and the related parallel forms, but it is also the result of relationships resulting from the context to determine a concept [8].

According to the RDF model, it’s possible to say that an identified entity (URI) manages relationships (predicate) with different objects: another entity (URI), i.e. another person, a place (URI), a date (URI), an event (URI), a contextual resource (URI) i.e. the document, an external resource (URI), that is another object (a document, an image, a video, an audio record, an so on).

We could try to apply this procedure to the responsibility of an individual identifiable as contributor of a digital edition who, on a specific date, performed a specific activity. TEI metadata propose two options for the responsibility description (<fileDesc> e <revisionDesc>): <fileDesc><titleStmt><respStmt> <resp>,<name>
<revisionDesc><respStmt> <resp>,<persName>

Each person is associated with a “responsibility” able to identify the function the entity covered in that document, linking people to resource. The same person could cover the same responsibility in other editions; in this way relationships might be extended to other documents. Other individuals could be moreover connected to the aforementioned person due to the sharing of the same responsibility.

This process could be declared and exposed as data set with RDF and URI for the syntax and TEI/EAC for classes and predicates in order to build a collection of authorities of people who covered either a role or a function in a certain time period and context. By declaring connections as relationships, through the EAC model, we could develop a knowledge base of people, with a context-originated function.

We can definitely say that digital editions open the door to the cultural heritage domain, establishing connections between heterogeneous objects and “creating efficiencies in the re-use of metadata across repositories, and through open linked data resources” [9]. Linked Data describing persons performing specific roles would be considerably improved by employing specifications relative to these persons’ function while using the context as interpretative key: “the description of personal roles and of the statuses of documents needs to vary in time and according to changing contexts [...] such roles and statuses need to be handled formally by ontological models.” [10]

Bibliography

  • [1] TEI Consortium (eds.). “13.3 Biographical and Prosopographical Data”. In Guidelines for Electronic Text Encoding and Interchange. Last updated on 21 December 2011. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ND.html#NDPERS
  • [2] TEI: Personography Task Force. http://www.tei-c.org/Activities/Workgroups/PERS/index.xml
  • [3] Wedervang-Jensen, Eva, and Matthew Driscoll, Report on XML mark-up of biographical and prosopographical data. 16 Feb 2006. http://www.tei-c.org/Activities/Workgroups/PERS/persw02.xml
  • [4] EAC-CPF, Encoded Archival Context for Corporate Bodies, Persons, and Families. http://eac.staatsbibliothek-berlin.de/
  • [5] CBPS – Sub-Committee on Descriptive Standards. “ISAAR (CPF): International Standard Archival Authority Record for Corporate Bodies, Persons and Families”. 2nd Edition, 2003. http://www.ica.org/10203/standards/isaar-cpf-international-standard-archival-authority-record-for-corporate-bodies-persons-and-families-2nd-edition.html
  • [6] Mazzini, Silvia, and Francesca Ricci. 2011. “EAC-CPF Ontology and Linked Archival Data”. In Semantic Digital Archives (SDA) Proceedings of the 1st International Workshop on Semantic Digital Archives. http://ceur-ws.org/Vol-801/
  • [7] Pitti, Daniel. 2004. “Creator Description: Encoded Archival Context”. Authority control in organizing and accessing information: definition and international experience. Ed. Arlene G. Taylor, 1941-, Barbara B. Tillett, Murtha Baca and Mauro Guerrini, 201-226. Binghamton N.Y.: Haworth Information Press
  • [8] Tomasi, Francesca. 2013. Le edizioni digitali come nuovo modello per dati d’autoritĂ  concettuali. JLis 4.2. 10.4403/jlis.it-8808
  • [9] Larson, Ray R., and Krishna Janakiraman. 2011. “Connecting Archival Collections: The Social Networks and Archival Context Project”. In Research and Advanced Technology for Digital Libraries. Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL 2011). Ed. Stefan Gradmann, Francesca Borri, Carlo Meghini and Heiko Schuldt, 3-14. Heidelberg, Germany: Springer. DOI: 10.1007/978-3-642-24469-8_3
  • [10] Peroni, Silvio, David Shotton, and Fabio Vitali. 2012. “Scholarly publishing and the Linked Data: describing roles, statuses, temporal and contextual extents”. In Proceedings of the 8th International Conference on Semantic Systems, 9-16. ACM, New York. DOI: 10.1145/2362499.2362502

Geolat: a digital geography for Latin literature

This paper presents the “Geolat” project, which aims to make accessible the Latin literature through a query interface of geographic / cartographic type. The project, under the name DAGOCLaT (Digital Atlas with Geographical Ontology for Classical Latin Texts) in 2012 was presented in response to the call of “Compagnia di San Paolo Foundation” and at the end of a blind peer evaluation managed by European Science Foundation was funded for exploratory and initial activities. In January 2013, under the name ALTUSS (Advanced Latin Texts Uses for School and Society) the project, revised and enriched among other things by an advisory board composed by Gregory Crane (Perseus, Pelagios), Tom Elliott (Pleiades) and Leif Isaksen (Google Ancient Places), was presented in response to the European call ERC Synergy.

The first objective of the project is to set up a digital library that contains the works of Latin literature from its origins to the end of the Roman Empire (conventional date, the 476 d. C.). This stage involves the integration of various already existing repository of Latin texts of high philological quality, which will be integrated starting from their already existing TEI/XML encoding. Building a (someone could say “the”) global digital library of ancient Latin literature is a very important field where APA is working [1], where Gregory Crane recently called [2] to start working, and where the “Geolat” project too will build its global library, because the library is a pre- condition for all the subsequent activities. All the library texts will be encoded with a very light TEI subset of tags.

In a second phase the works so collected are analyzed at morphological level by means of a parser (that of Lasla of Liùge [3]) so as to associate with each word its analysis / morphological description, which includes the identification of proper names. After that, by means of manual intervention, geographic references will be progressively encoded in a formal manner by adopting the TEI elements <placeName> and <geogName> (described in the TEI Guidelines in chapter 13 “Names, Dates, People, and Places”). Each occurrence of place names and geographical references will be identified by a URI (using the @ref attribute) that will point to a formal description of the place in a formal ontology of the ancient Latin world geography (the traditional printed reference was and still is the Barrington Atlas [4]).

This ontology will be built ad hoc, reusing the data offered by the Pleiades gazetteer [5], and establishing relationships with other relevant geographic ontologies, where possible, such as Geonames. In general the ontology will be structured in a two tier fashion (following the tradition in DL ontology modelling): a T-box modelling geospatial classes of locations their properties and their relationships and an A-box with geospatial information about individual places and location. At this level the sites of antiquity will be associated with a variety of information:

  • URI (and eventual links to URIS in other data sets)

  • GPS coordinates

  • different names, time frames of validity and etymology

  • belonging to an itinerary (pilgrimage, military expedition, etc.)

  • typology

  • historical, geographical, cultural annotations

  • links to other relevant Linked Data sets

A third level of modeling will be tied to the logical relationship between textual references (and their annotations by an encoder) and their referent in the ontology. In fact, you can easily detect that the textual context in which each geographical word (or phrase) is placed determines different modes of reference. From this point of view it seems necessary to introduce into the system an ontology of (geographic) annotations that can account for this variety of reference. In our work we will also discuss the various operational opportunities to formalize this information at the level of inline markup or through links to RDF statements in stand-off markup.

All the resources produced in our project, as he primary sources as the geographic thesaurus and the list of textual annotations that link geographic locations and places text (identified by URI) will be made available on the Web according to the principles of Linked Data, and will help to enrich the “Web of Data” with new content.

Bibliography

  • [1] APA Digital Latin Library Project http://www.apaclassics.org/index.php/research/digital_latin_library_project [2] Gregory Crane call http://sites.tufts.edu/perseusupdates/2013/02/14/possible-jobs-in-digital- humanities-at-leipzig/
  • [2] LASLA http://www.cipl.ulg.ac.be/Lasla/
  • [3] Talbert R. (ed.), The Barrington Atlas of the Greek and Roman World, Princeton University Press 2000
  • [4] Pleiades Project, http://pleiades.stoa.org/
  • [5] GAP – Google Ancient Places, http://googleancientplaces.wordpress.com/
  • [5] GAPvis http://googleancientplaces.wordpress.com/gapvis/
  • [6] Tom Heath and Christian Bizer, Linked Data: Evolving the Web into a Global Data Space. S7nthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool 2011
  • [8] Tom Elliiot, S. Gillies, “Digital Geography and Classics, in Digital Humanities Quarterly 3.1 (Winter 2009), http://www.digitalhumanities.org/dhq/vol/3/1/000031.html
  • [9] Open Annotation Data Model, Open Annotation Community Group 2013, http://www.openannotation.org/spec/core/

Bringing semantic publishing in TEI: ideas and pointers

TEI has a full set of elements that can be used to describe facts about the publication details of a text, such as editionStmt, publicationStmt, and sourceDesc. A numerous list of sub-elements allows a zealous editor to provide a rich overview of publication aspects of the paper editions of the text, of this specific XML document, and of the steps through which an original source has made this XML possible. Several collections of allowable values for these elements exist, as thesauri, authority lists or simple value lists, that simplify the task to describe frequent or common situations, and that homogenize similar occurrence in different documents of the same collection. In a way, we could characterize value thesauri as external aids to improve internal quality of digital collections of texts.

In the last few years, a new discipline has arisen, semantic publishing, that tries to improve the scientific communication by using of web and semantic web technologies to enhance a published document so as to enrich its meaning, to facilitate its automatic discovery, to enable its linking to semantically related articles, to provide access to data within the article in actionable form, and to allow integration of data between papers [1,2]. Its main interest lies in the organization and description of scientific literatures, trying to tame the incredible complexity of the modern scientific publishing environment, both in terms of size and credibility of publishing venues, authors, research groups and sponsors. For instance, SPAR [3,4,5] is a suite of orthogonal and complementary ontology modules for creating comprehensive machine-readable RDF metadata for all aspects of semantic publishing and referencing, each of them precisely and coherently covering one aspect of the publishing domain using terms with which publishers are familiar. Together, they provide the ability to describe bibliographic entities such as books and journal articles, reference citations, the organization of bibliographic records and references into bibliographies, ordered reference lists and library catalogues, the component parts of documents, and publishing roles, publishing statuses and publishing workflows. SPAR ontologies have been already used in different projects such as JISC Open Citations Project [6] – a database of biomedical literature citations, harvested from the reference lists of all open access articles in PubMed Central that reference ~20% of all PubMed Central papers (approx. 3.4 million papers), including all the highly cited papers in every biomedical field – and Semantic Web Applications in Neuromedicine (SWAN) Project [7].

One of the main aims of semantic publishing therefore is to create a rich network of interconnected facts about publications from which interesting patterns can emerge to discover, for instance, clusters of similar publications, intrinsic values of publication venues, emerging trends in publication topics, etc. In a way, we could characterize annotations coming from actual documents as internal aids to improve the external qualities of digital collections of texts, especially regarding emerging characteristics of the collections themselves rather than belonging to individual documents.

We believe that the combination of these aspects could be mutually beneficial both in the increased quality of the individual documents, as well as in the increased quality and explorability of the emerging properties of document collections.

Being able to associate a full set of related facts to individual values in individual elements of the publication and edition details of the electronic version of a text provides the end user with a large and interesting network of considerations that go well beyond the individual text, and using standard tools from the Semantic Web may well allow reader to connect and exploit, for instance, the vast and growing collections of facts that embody the Linked Data initiative.

The actual syntax for this mesh is not particularly relevant. What is relevant is that through some syntactical mechanisms, it ends up being possible for an individual TEI document to feed Linked Data new and interesting facts about the corresponding publications and the involved actors, and conversely for Linked Data collections to enrich the amount of information about the publication and the involved actors that is made available to the interested reader, directly or after explicit queries, automatically or through the filtering and selecting action of an electronic editor.

The actual link between TEI documents and Linked Data resources is already feasible by adopting particular techniques and tools. Mainly, there are two ways to enable annotations linking existing TEI documents to Linked Data resources: either one embeds the annotation in the document itself (embedding techniques) or the annotations are stored in a separate document with references to the parts of the document each annotation refers to (standoff techniques). Neither the use embedding nor the use of standoff annotations is wrong or correct on its own; each technique has its own pros and cons that must be evaluated case by case before using them.

Even though many techniques have been devised in the past, usually the more technical solutions address only the problem of how to store the annotations, without dealing with the meaning of the annotations themself. In the case of embedded annotations, these solutions offer a generic way to augment existing markup with annotations (e.g. RDFa [9]). In the case of standoff annotations, the existing technical solutions provide a way to address content (e.g. EARMARK [10-11] and NIF [12]). In addition to other approaches, EARMARK offers an extension [13] to actually express the meaning of the annotation and allows one to easily link bunch of text in TEI documents to external resources. It also provides a Java API [14] to support users in creating (even overlapping) annotations upon the same text, keeping track of provenance information such as the author who made the annotation and the time in which the annotation has been created.

The technical solutions are only one half of what is needed to annotate documents. The other half is the use of an annotation model and vocabulary. There are many such vocabularies available, ranging from very generic annotation frameworks (e.g. the Open Annotation Data Model (OADM) [15] or the Annotation Ontology [16]), to more specific frameworks (e.g. the Linguistic Annotation Framework (LAF) [17], used to annotate the various linguistic features of a speech through its transcript, or Domeo [18], that describes annotations used to connect scholarly documents).

Bibliography

  • [1] Shotton, D. (2009). Semantic publishing: the coming revolution in scientific journal publishing, Learned Publishing 22 (2): 85–94. DOI: 10. 1087/2009202
  • [2] Shotton, D., Portwin, K., Klyne, G., Miles, A. (2009). Adventures in semantic publishing: exemplar semantic enhancements of a research article, PLoS Computational Biology 5 (4): e1000361. DOI: 10.1371/journal.pcbi. 1000361
  • [3] Semantic Publishing and Referencing Ontologies: http://purl.org/spar
  • [4] Peroni, S., Shotton, D. (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. In Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 17 (December 2012): 33-43. DOI: 10.1016/j.websem.2012.08.001
  • [5] Peroni, S., Shotton, D., Vitali, F. (2012). Scholarly publishing and the Linked Data: describing roles, statuses, temporal and contextual extents. In Presutti, V., Pinto, H. S. (Eds.), Proceedings of the 8th International Conference on Semantic Systems (i-Semantics 2012): 9-16. DOI: 10.1145/2362499.2362502
  • [6] JISC Open Citations homepage: http://opencitations.net
  • [7] Ciccarese, P., Wu, E., Kinoshita, J., Wong, G., Ocana, M., Ruttenberg, A., Clark, T. (2008). The SWAN biomedical discourse ontology, Journal of Biomedical Informatics 41 (5: 739–751. DOI: 10.1016/j.jbi.2008.04.010
  • [8] Huitfeldt, C., Sperberg-McQueen, C. M. (2001). Texmecs: An experimental markup meta-language for complex documents. Working paper of the project MLCD, University of Bergen
  • [9] Adida, B., Birbeck, M., McCarron, S., Herman, I. (2012). RDFa Core 1.1. W3C Recommendation, 7 June 2012. World Wide Web Consortium. http: //www.w3.org/TR/2012/REC-rdfa-core-20120607/
  • [10] Di Iorio, A., Peroni, S., Vitali, F. (2011). Using Semantic Web technologies for analysis and validation of structural markup. In International Journal of Web Engineering and Technologies, 6 (4): 375-398. Olney, Buckinghamshire, UK: Inderscience Publisher. DOI: 10.1504/IJWET.2011.043439
  • [11] Di Iorio, A., Peroni, S., Vitali, F. (2011). A Semantic Web Approach To Everyday Overlapping Markup. In Journal of the American Society for Information Science and Technology, 62 (9): 1696-1716. Hoboken, New Jersey, USA: John Wiley & Sons, Inc. DOI: 10.1002/asi.21591
  • [12] Hellmann, S., Lehmann, J., Auer, S. (2012). Linked-data aware uri schemes for referencing text fragments. In ten Teije, A., Völker, J., Handschuh, S., Stuckenschmidt, H., d’Aquin, M., Nikolov, A., Aussenac-Gilles, N., Hernandez, N. (Eds.), Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012), Lecture Notes in Computer Science 7603: 398-412. Berlin, Germany: Springer. DOI: 10.1007/978-3-642-33876-2_17
  • [13] Peroni, S., Gangemi, A., Vitali, F. (2011). Dealing with Markup Semantics. In Ghidini, C., Ngonga Ngomo, A., Lindstaedt, S. N., Pellegrini, T. (Eds.), Proceedings the 7th International Conference on Semantic Systems (I-SEMANTICS 2011): 111-118. New York, New York, USA: ACM. DOI: 10.1145/2063518.2063533
  • [14] Barabucci, G., Di Iorio, A., Peroni, S., Poggi, F., Vitali, F (2013). Annotations with EARMARK in practice: a fairy tale. Submitted for publication in the 1st Workshop on Collaborative Annotations in Shared Environments: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013)
  • [15] Sanderson, R., Ciccarese, P., de Sompel, H. V. (2013). Open annotation data model. W3C Community draft, 08 February 2013. http: //www.openannotation.org/spec/core/20130208/
  • [16] Ciccarese, P., Ocana, M., Garcia Castro, L., Das, S., Clark, T. (2011). An open annotation ontology for science on web 3.0. Journal of Biomedical Semantics, 2 (2): 1–24. DOI: 10.1186/2041-1480-2-S2-S4
  • [17] ISO (2012). ISO 24612:2012 Language resource management — Linguistic annotation framework (LAF). ISO
  • [18] Ciccarese, P., Ocana, M., Clark, T. (2012). Open semantic annotation of scientific publications using DOMEO. Journal of Biomedical Semantics, 3 (1): 1–14. DOI: 10.1186/2041-1480-3-S1-S1

Notes
1
The ODD document can be found at http://www.empirikom.net/bin/view/Themen/CmcTEI
2
http://wiki.itmc.tu-dortmund.de/cmc/
3
http://www.dwds.de
4
http://repository.mulce.org
5
Schema for the instantiation component of a LETEC corpus. http://lrl-diffusion.univ-bpclermont.fr/mulce/metadata/mce-schemas/mce_sid.xsd6
6
https://groupes.renater.fr/wiki/corpus-ecrits-nouvcom/public/proj-tei/index
7
http://www.glottoweb.org/web2corpus/