[9 Sep (Sun) Afternoon] Half-day Workshops

Spoken Language : Tools and Workflow for Creating and Editing Data and Metadata

Carole Etienne (ICAR Lab, France), Christophe Parisse (Modyco/INSERM, Nanterre, France), Loïc Liégeois (LLF/CILLAC-ARP Labs in Paris)
(Room B2,3)

Most part of people working on oral corpora have to deal with both the choice of a set of metadata clear enough to make their corpora reusable by a large community and the use of several transcription tools which require to develop specific software to make them working together with no lack of information.

Generally, scripts have been developed in the different teams matching only some needs, working either on Windows, Mac, or Linux but rarely on more than one system, delivering data in a dedicated format difficult to share, not really user-friendly, with no real maintenance and evolution. In this workshop, we would like to present the two free and open-source solutions we have developed inside IRCOM/CORLI and ORTLOLANG two French infrastructures which are available for a large community:

  • teiMeta is a tool for editing metadata in any XML (and TEI) file. Its goal is to allow editing or adding a common set of metadata to any file without damaging the other data in the file. The software is based on an ODD file and a stylesheet, and TEI is generated automatically. So it is possible to create as many versions of metadata editing as required for real applications. teiMeta works inside a web browser, so it is multi-system compatible.
  • teiConvert is a tool for converting transcript file between different software (Transcriber, Clan, Praat or Elan) using a TEI pivot format. It can generate data in TEI format but also csv, txt (Utf 8), docx, txt for TXM software or txt for Trameur/Lexico software. teiConvert is written in Java, so it works on many systems. An alternate web interface exists for people that do not want to install Java and to use command line instructions.

After a presentation of teiMeta and teiConvert, we plan to organize a practical session where we can work on some different examples of oral corpora of different types: acquisition, sociolinguistic, phonetic, rare languages or interactions. The goal of the workshop is to show how to obtain high quality data in TEI to be used for research purposes, data sharing and preservation. To make this practical time more efficient, people could send us in advance some examples of their metadata and transcripts to work on their own data and find solutions together. We can give some examples of different stylesheets or ODD files to adapt the tools to participant's needs.

References

Liégeois, L., Etienne, C., Parisse, C., Benzitoun, C., Chanard, C. (in press), "Using the TEI as pivot format for oral and multimodal language corpora ". Journal of the Text Encoding Initiative, 10, halshs-01357343.
Liégeois, L., Etienne, C., Benzitoun, C., Parisse, C. (2017). Vers un format pivot commun pour la mutualisation, l'échange et l'analyse des corpus oraux, Floral, Orléans.
Parisse, C., Benzitoun, C., Etienne, C., Liégeois, L. (2017) Agrégation automatisée de corpus de français parlé. Journées de Linguistique de Corpus, Grenoble, France.