Teaching TEI for Computer Science Students (paper)
Yael Netzer* Yael Netzer holds a PhD in Computer Science and an MA for studies in Hebrew Literature from Ben Gurion University. She is a Teaching fellow in BGU and a Digital Humanities expert in Dicta, the Israeli center for text analysis.
1Teaching Digital Humanities to Computer Science students raises questions as to the extent
of engagement CS students may acquire towards analyzing content of literary or historical
texts. In the last three years I have been teaching an undergraduate course in Digital
Humanities at the department of CS in Ben Gurion University, Israel. This is an elective
course for students in their third and last year, only a few double-major CS/humanities.
Practically, for most of the students, this is the first time that the content of the input that they
feed their programs has meaning, and not only its structure or type. Therefore, one of the
first objectives in the course is to understand the difference between structure, form and
content, and the difference between the explicit data from the implicit metadata which must
be added to the documents; since XML is not new to the students, TEI can be introduced
quite naturally to express these concepts. We talk about the various materials (from
manuscripts to scanned books, digital book, lists and tables, maps and art artifacts), and
review the data structures and types of files with respect to the notions of unstructured,
semistructured and structured documents, and then explore which and how automation can
be used in order to form these different types of documents. We discuss questions of
classification and terminology (criteria of classes, origins of existing classifications etc.),
structures of catalogues, the basic ideas of natural language processing (NLP), information
retrieval, and then various existing standards and models of knowledge such as TEI and
Linked Open Data; we show how these issues are used for answering questions in literature (e.g., the Syuzhet project of Matthew Jockers1 and the criticism following it2 , which point on
the limitations or lack of adjustments of a tool, such as lexicon-based sentiment analysis to a
task).
2In their assignments, the students exercise with collecting, cleaning, annotating and
visualizing data from various resources that are related or used in the Humanities, mostly
Hebrew texts (e.g., the Ben Yehuda Project3 Lexicon of Hebrew Authors,4 Lyrics sites,
Government protocols etc.). Each assignment incrementally adds aspects not only about
processing, but also about the work of the humanist. These skills, tools and practices later
enable them to automatically form collections of texts, annotated with at least basic TEI,
cleaned, and enriched with linguistic information, linked to external resources, visualized or
processed in various manners.
3At first, students manually annotate a few texts with TEI (using Oxygen). These texts are
usually structured by their nature: a letter, a poem, an entry of a lexicon. Emphasize in this
task in on three main issues: identifying and collecting metadata that should be placed in the
header; understanding and identifying the structure of the body, and understanding what
type of content can or should be annotated. Students sum up their experience in a report,
and thay are asked to identify what kind of decisions they were supposed to make (for
instance, prepositions in Hebrew are agglutinated to the words, so annotation of locations
may split a word:
). Next, they write code that
automates the process. We discuss the types of information that can be easily identified
(with regular expression, for instance), which can be induced with NLP tools and which
annotation requires more world-knowledge judgement. I usually ask them to annotate named
entities (persons and locations, dates and roles). The evaluation of their automatic work is
done with respect to the documents they have annotated manually. They sum up and reflect
the results.
b<placeName>Tel Aviv</placeName>
4Later in the course, students learn how to query linked-open-data resources, and can now
process, enrich and visualize information from the texts. They “mix” different tools such as
OpenRefine, Google fusion tables and python code.
5In a final project, for example, students identified that in the lexicon, the first section of every
entry contains the author’s biography and the second section contains subjects of her books.
Once this structure was identified and annotated, students can now write a program that
compares locations mentioned in each of these sections and explore the correlation between
these two lists.
6When evaluating the students’ works I find that some students manage to understand and
internalize DH ideas: they are excited by the wide variety of questions that can be asked, the
ability to look at content are realize that it is not just a string, and the practical tools that exist
in order to explore an answer; they make an effort to evaluate and reflect their findings in
their assignments. On the other hand, some students are less sensitive to their results and
will suffice with the notion “if the program ran with no bug then I have succeeded.”
7In my talk I will show examples of projects and their outcomes.
Notes
- “Revealing sentiment and plot arcs with the Syuzhet package,” http://www.matthewjockers.net/2015/02/02/syuzhet/
- Annie Swafford, “Problems with the Syuzhet Package” https://annieswafford.wordpress.com/2015/03/02/syuzhet/
- http://benyehuda.org/(the Hebrew parallel of the Gutenberg project)
- Heksherim Lexicon of Israeli Authors, Kinneret, Zmora-Bitan, Dvir - Publishing House Ltd. and Heksherim: The Research Institute for Jewish and Israeli Literature and israeli Culture, BGU, 2014