TEI 2017: Teaching TEI for Computer Science Students

TEI 2017 Victoria, British Columbia, Canada November 11 - 15 #tei2017Vic

Teaching TEI for Computer Science Students (paper)

Yael Netzer* Yael Netzer holds a PhD in Computer Science and an MA for studies in Hebrew Literature from Ben Gurion University. She is a Teaching fellow in BGU and a Digital Humanities expert in Dicta, the Israeli center for text analysis.

1Teaching Digital Humanities to Computer Science students raises questions as to the extent of engagement CS students may acquire towards analyzing content of literary or historical texts. In the last three years I have been teaching an undergraduate course in Digital Humanities at the department of CS in Ben Gurion University, Israel. This is an elective course for students in their third and last year, only a few double-major CS/humanities. Practically, for most of the students, this is the first time that the content of the input that they feed their programs has meaning, and not only its structure or type. Therefore, one of the first objectives in the course is to understand the difference between structure, form and content, and the difference between the explicit data from the implicit metadata which must be added to the documents; since XML is not new to the students, TEI can be introduced quite naturally to express these concepts. We talk about the various materials (from manuscripts to scanned books, digital book, lists and tables, maps and art artifacts), and review the data structures and types of files with respect to the notions of unstructured, semistructured and structured documents, and then explore which and how automation can be used in order to form these different types of documents. We discuss questions of classification and terminology (criteria of classes, origins of existing classifications etc.), structures of catalogues, the basic ideas of natural language processing (NLP), information retrieval, and then various existing standards and models of knowledge such as TEI and Linked Open Data; we show how these issues are used for answering questions in literature (e.g., the Syuzhet project of Matthew Jockers1 and the criticism following it2 , which point on the limitations or lack of adjustments of a tool, such as lexicon-based sentiment analysis to a task).

2In their assignments, the students exercise with collecting, cleaning, annotating and visualizing data from various resources that are related or used in the Humanities, mostly Hebrew texts (e.g., the Ben Yehuda Project3 Lexicon of Hebrew Authors,4 Lyrics sites, Government protocols etc.). Each assignment incrementally adds aspects not only about processing, but also about the work of the humanist. These skills, tools and practices later enable them to automatically form collections of texts, annotated with at least basic TEI, cleaned, and enriched with linguistic information, linked to external resources, visualized or processed in various manners.

3At first, students manually annotate a few texts with TEI (using Oxygen). These texts are usually structured by their nature: a letter, a poem, an entry of a lexicon. Emphasize in this task in on three main issues: identifying and collecting metadata that should be placed in the header; understanding and identifying the structure of the body, and understanding what type of content can or should be annotated. Students sum up their experience in a report, and thay are asked to identify what kind of decisions they were supposed to make (for instance, prepositions in Hebrew are agglutinated to the words, so annotation of locations may split a word:

b<placeName>Tel Aviv</placeName>

). Next, they write code that automates the process. We discuss the types of information that can be easily identified (with regular expression, for instance), which can be induced with NLP tools and which annotation requires more world-knowledge judgement. I usually ask them to annotate named entities (persons and locations, dates and roles). The evaluation of their automatic work is done with respect to the documents they have annotated manually. They sum up and reflect the results.

4Later in the course, students learn how to query linked-open-data resources, and can now process, enrich and visualize information from the texts. They “mix” different tools such as OpenRefine, Google fusion tables and python code.

5In a final project, for example, students identified that in the lexicon, the first section of every entry contains the author’s biography and the second section contains subjects of her books. Once this structure was identified and annotated, students can now write a program that compares locations mentioned in each of these sections and explore the correlation between these two lists.

6When evaluating the students’ works I find that some students manage to understand and internalize DH ideas: they are excited by the wide variety of questions that can be asked, the ability to look at content are realize that it is not just a string, and the practical tools that exist in order to explore an answer; they make an effort to evaluate and reflect their findings in their assignments. On the other hand, some students are less sensitive to their results and will suffice with the notion “if the program ran with no bug then I have succeeded.”

7In my talk I will show examples of projects and their outcomes.

Notes

“Revealing sentiment and plot arcs with the Syuzhet package,” http://www.matthewjockers.net/2015/02/02/syuzhet/
Annie Swafford, “Problems with the Syuzhet Package” https://annieswafford.wordpress.com/2015/03/02/syuzhet/
http://benyehuda.org/(the Hebrew parallel of the Gutenberg project)
Heksherim Lexicon of Israeli Authors, Kinneret, Zmora-Bitan, Dvir - Publishing House Ltd. and Heksherim: The Research Institute for Jewish and Israeli Literature and israeli Culture, BGU, 2014