Programs and Sessions

1. Keynote presentations

1.1. TEI at 20: Congratulations! The Next 20 Will Tell The Tale

Speaker: B. Tommie Usdin (Mulberry Technologies)

Chair: Syd Bauman (Brown University)

The TEI's accomplishments are something to be proud of; just surviving 20 years is a a major accomplishment! In addition, the TEI has provided useful tools for a variety of projects and has influenced the larger computing world in subtle but important ways. TEI's future is far more interesting than its past. It is time for the TEI, at 20, to think hard about the future, and to take steps to shape its future. The time for youthful indiscretions is past, as is the time for pretending that just being good (or better, or even best) is good enough. If the next 20 years of the TEI are to be celebrated, the community must set a clear path for itself and focus on a defined but limited set of goals. While I cannot say what those goals should be, I will pose some questions that I think the TEI has not, but should, answer in shaping your goals for the next 20 years.

B. Tommie Usdin is president of Mulberry Technologies, Inc., a consultancy specializing in XML and SGML. Ms. Usdin has been working with SGML since 1985 and has been a supporter of XML since 1996. She chaired the Extreme Markup Language conferences, and will chair Balisage: the Markup Conference conferences. She was co-editor of Markup Languages: Theory & Practice published by the MIT Press. Ms. Usdin has developed Schemas (in DTD, XSD, and RNG syntax), and XML/SGML application frameworks for applications in government and industry. Projects include reference materials in medicine, science, engineering, and law; semiconductor documentation; historical and archival materials. Distribution formats have included print books, magazines, and journals, and both web- and media-based electronic publications.

1.2. Teaching TEI: The need for TEI by Example

Speaker: Melissa Terras (University College London)

Chair: Matthew Driscoll

In order to expand the user base of TEI, it is important that it is taught at University level to students undertaking vocational courses, especially in the area of digital libraries and electronic communication. However, there are limited teaching materials available which support University level teaching of TEI. Materials which are available are not in formats which would enable tutorials to be provided in classroom settings, or allow individuals to work through graded examples in their own time: the common way of learning new computational techniques through self-directed learning.

This paper will present the work currently being carried out by the TEI by Example project to develop online tutorials in TEI which present examples of code from which users can learn. The project is supported by The Centre for Scholarly Editing and Document Studies (CTB) of the Royal Academy of Dutch Language and Literature, the Centre for Computing in the Humanities (CCH) of King's College London, and the School for Library, Archive, and Information Studies (SLAIS) of University College London. This paper will report on progress, problems, and potential solutions in developing teaching materials for the TEI, and highlight areas in which the TEI community can aid in the usefulness of materials for students and teachers.

Melissa Terras is a Lecturer in Electronic Communication in the School of Library, Archive and Information Studies at University College London (UCL), teaching Internet Technologies, Digital Resources in the Humanities, and Web Publishing. With a background in History of Art and English Literature, Computing Science, and a DPhil in Engineering Science, her research focuses on the use of computing in the arts and humanities which will allow research which would be otherwise impossible. Areas of interest include Humanities Computing, Digitization and Digital Imaging, Artificial Intelligence, Palaeography, Knowledge Elicitation, and Internet Technologies. Recent research includes the ReACH (Researching e-Science Analysis of Census Holdings) project, LAIRAH (Log Analysis of Internet Resources in the Arts and Humanities), and the recently JISC funded VERA project, Virtual Environments for Research in Archaeology. She is on the board and executive of various subject organizations, is a general editor of Digital Humanities Quarterly, and is co-manager of TEI by Example.

1.3. TEI in a crystal ball

Speaker: Fotis Jannidis (Technical University of Darmstadt)

Chair: Lou Burnard (Oxford University Computing Services)

The talk analyzes the history of the TEI, tries to isolate some trends in its development and tries to extrapolate these trends into the future. Aspects to be included:

Fotis Jannidis, born 1961 in Frankfurt a.M. / Germany, studied German and English in Trier and Munich. He wrote his Ph.D. thesis on Goethe and his second book on characters in narratives. He is now Professor of German literary studies at the Technical University of Darmstadt. His interest in humanities computing dates back to the late 1980es, when he earned his money by transcribing texts for one of the first German digital editions. Since 1996 he has been promoting TEI in the German community of literary editors. In the same year he started – together with Karl Eibl and Volker Deubel – the online journal Computerphilologie. Since 2006 he has been involved in the e-humanities project Textgrid, a workplace for literary scholars and linguists using grid technology.

2. Roundtable Discussion: "New Directions in Digital Funding"

A featured roundtable, "New Directions in Digital Funding", will include representatives from major funding agencies including:

3. Panel sessions

3.1. There is more than one way to get there from here

Convenor: Kevin Hawkins

This panel session provides attendees with opportunities to learn the many ways TEI documents are created, maintained, and presented within our community. Specifically, small, medium, and large-scale as well as academic, commercial, and production-level implementations will be compared and contrasted.

3.1.1. Encoding the Metadata of Medieval and Renaissance Manuscripts for the Digital Scriptorium Project

Ben Panciera and Rob Fox (University of Notre Dame)

The University of Notre Dame is participating in the Digital Scriptorium project being administered and hosted by Columbia University. Our goal is to thoroughly describe our collection of Medieval and Renaissance manuscripts using P5, give these descriptions to Columbia, who will then incorporate them into a larger catalog of manuscripts. Medieval studies graduate students first examine the manuscripts and record as much metadata about them as possible. A librarian then supplements the metadata and marks it up using JEdit. To make the content searchable and browsable a programmer indexes the resulting TEI files using Lucene, provides access to the index through an Search/Retrieve via URL (SRU) interface, and transforms search results into HTML using Perl and XSLT. This standards-based and open source combination of tools has allowed us to create a modular process whose parts are easily substituted for other components as necessary.

3.1.2. Digital Scholarship from Heterogeneous TEI Sources: Workflows and Tools at the University of Virginia Press's ROTUNDA Imprint

David Sewell (University of Virginia Press)

For four years, the University of Virginia Press has been preparing and publishing digital scholarly publications based in all cases on documentary data encoded in TEI-XML. But our seven completed publications differ widely in the origin and nature of their underlying data (born-digital versus digitized print, for example), their versions and/or dialects of TEI, and their user interfaces. We will discuss the workflows we have developed for working with both project authors and data conversion vendors; the tools we use for editing, transforming, and checking XML data; and our publication delivery system.

3.1.3. Large-Scale TEI Production as a Part of the Text Creation Partnership

Paul Schaffner (University of Michigan)

The University of Michigan has been producing TEI-encoded electronic texts as a part of Text Creation artnership since 2000. To date we have produced just about 20,000 items. Most of the content is keyed and marked up from images scanned from microfilm, minimally but functionally encoded using the P3/P4 SGML tag sets, enhanced with the data from MARC records, and converted into various XML formats for local indexing or distribution to partner institutions, where they are managed under different retrieval systems. Because of the volume of content and scale of the operation, the entire process tends towards simplicity, modularity, and transparency. Expedients include judicious pruning of the tagset and character inventory, generic and adaptable coding guidelines, a transparent and modular process, use of simple and well-documented tools (text editors, Perl, XSLT, batch files, etc.), and a constant readiness to modify any or all of these at any moment. This session describes our process in greater detail.

3.2. Manuscript encoding

Convenor: Elena Pierazzo (Centre for Computing in the Humanities, King's College London, United Kingdom)

The panel constitutes an overview of the Manuscripts SIG area of discussion. The three papers included cover different sectors of the complex world of the manuscripts transcription and edition, going from a workable solution for manuscripts digital catalogue (Marek), to encoding of handwritings in modern correspondence (Ohara) and to a proposal for a standard solution for manuscripts transcription (Lavrentiev). Two of the paper concerns traditional manuscripts (ancient and medieval), one modern correspondence and the three come from different scholar and librarian traditions.

The presentation format chosen (pre-circulated papers) will allow a wide discussion on current practice on manuscripts encoding and will help revitalise the Manuscript SIG. The presence of papers that come from different traditions (ancient and modern, aimed to transcription and to cataloguing) will help to keep together different souls of the Manuscript SIG, sharing different approaches and perspectives to the common problem of transcribing and cataloguing manuscripts.

All papers implies the usage of P5 and contain methodological and issues, moving form concrete encoding practice to a more general theoric framework, suggesting possible new development for the TEI.

3.2.1. Imitative transcriptions of European medieval manuscripts: what for and how?

Alexei Lavrentiev (ENS Lettres et Sciences Humaines, Lyon, France)

Traditional editions of medieval texts can be roughly divided into two categories: diplomatic editions that try to follow closely the source manuscript and critical editions that aim to correct some scribal errors and to make the text more readable to modern public. As far as the manuscripts written in Latin alphabet are concerned, features of critical editions include introducing u/v and i/j distinction, expanding abbreviations, neutralizing letter variants (like 'long s' or 'round r'), introducing some diacritics (like ç or é in French texts), normalizing word separation and punctuation. Diplomatic editions (or transcriptions) may preserve original i, j, u, and v characters and do without diacritics. Abbreviations are usually expanded but the supplied letters are put in brackets or italics. Letter variants, word segmentation and punctuation are normalized in most cases (as they are in critical editions).

However, researchers – linguists in particular – may be interested in studying abbreviation marks, 'abnormal' word segmentation or punctuation, and paleographers would go even further in analyzing character glyphs. Reproducing all these features in a paper edition would make it virtually unreadable for a non-specialist. 'Imitative' paper editions of this kind do exist but they are very rare.

The situation has changed drastically with electronic editions. Computer technologies make it possible to create multi-layer transcriptions in which the reader may choose the presentation form he finds the most appropriate for a specific study. A number of projects including digital imitative manuscript transcriptions have been carried out in different countries. However, methodological principles and technological solutions adopted in these projects are far from being stable and interoperable. Even the projects based on XML technology and following TEI Guidelines differ considerably in the way the imitative transcriptions are encoded and linked with the other edition layers.

In this paper I will present the manuscript transcription schema used in the Base de Français Médiéval (BFM) project and compare it to a 'standard' solution based on the TEI P5 recommendations. TEI P5 offers a powerful tool for 'abnormal' character encoding, the <g> element. It seems however that some further developments are necessary to ensure an adequate and transparent transcription encoding.

BFM manuscript encoding practice is based on the experience of the Charrette Project and on the recommendations formulated by the Menota Project. I will argue that one of the crucial choices to be made in a transcription project is the structural level at which the layers of transcription should split (character or word). I will consider the pros and contras of both solutions and will argue that the word level appears to be a reasonable compromise between adequacy of data representation and ease of processing.

3.2.2. Is it OK for a machine-readable text to become more complex than the original edition? The case of graphological study of the letters of Margaret Paston.

Osamu Ohara (Jikei University School of Medicine)

According to Norman Davis, there are no letters of Margaret Paston actually written by herself. All of her letters were written by her sons or her closest employees. Davis says that there are 29 different styles of handwriting found in the letters. Some letters were written by one amanuenses from the beginning to the end, but others were written by two or more of them.

Through the graphological comparisons of several graphemes in those letters, however, I (Ohara (2006)) have shown that one of them was not necessarily written by the indicated amanuensis and the re-examination of the handwriting styles of each amanuensis is necessary. In the TEI P5 markup system, the change in the hand is shown by the element <handShift>, which plays a part of a landmark where the old hand ends and an new hand starts.

This paper will examine two issues. The first is whether an electronic text of the letters of Margaret Paston compiled based on the TEI P-5 can have a useful structure for the study of graphological differences among each amanuenses, and the second is whether it is all right for this kind of text to be made more machine-readable at the cost of complexity and nonlinearity of the contents themselves.

3.2.3. Describing medieval manuscripts for the electronic environment: the problem of retrieval elements

Jindřich Marek (National Library of the Czech Republic, Prague)

When talking about manuscript descriptions, we especially mean encoding of medieval manuscripts. The former approach to manuscript description was impressed by the aim of institutions to provide a complex overview of their collections sorted by provenance groups and/or by the language used in manuscripts. The form of records in such catalogue raisonné is influenced by the printed environment. These catalogues are accessible by printed indexes or by general indexes respectively.

In TEI Guidelines (P5, Chapter 10, Manuscript Description) are declared two possible goals of the Manuscript Description module use. The first is to represent an existing (printed) catalogue; the second is to present an end result of the brand new description. We will consider especially the second goal. It is clear that the search (that replaces indexes) is of the most importance in the electronic environment. This is the reason why the correct use of retrieval elements is required. Unfortunately, most cataloguers are used to work with the printed environment and they comprehend with troubles how to handle the electronic environment.

This presentation aims to investigate the differences between electronic and printed environment in the field of manuscript description, especially in the manuscript contents description. It is based on concrete problems and concrete examples of the elements' use. These examples were developed by cataloguing the collection of the National Library of the Czech Republic for the Manuscriptorum system. Further, the presentation will investigate what is common for both environments. First of all this is the aim to provide information about manuscript. The number of manuscripts described in printed catalogues also implies the question about possible retrospective conversion of existing catalogue records. Some examples of automatic conversion executed in the mentioned institution will follow. It will show advantages and disadvantages of such process, especially with the regard to the theme of retrieval elements. Again, the presentation will discuss the advisable form of records which will be able to append the transcription of the TEI encoded manuscript text edition semi-automatically.

The problem of P4 (MASTER) to P5 (Manuscript Description module) transition bears on these questions. The presentation will show what was changed in P5 in relationship to retrieval elements. At last, the presentation will focus on problems of organization and institutional workflow which are closely connected with mentioned requirements for catalogue records. The final question of my presentation will be: do we need concrete rules for Manuscript Description module usage which will be the further specification of the Guidelines, or not?

3.3. Outside the box

Chair: Sebastian Rahtz (Oxford University Computing Services)

3.3.1. Mapping from TEI to CIDOC-CRM: Will the New TEI Elements Make any Difference?

Øyvind Eide and Christian-Emil Ore (Unit for Digital Documentation, University of Oslo)

Mapping of TEI

In May and June 2004, there was a discussion on the TEI mailing list about prosopographical tags. This lead to a suggestion that detailed information about persons (physical and legal), dates, events, places, objects etc. and their interpretation could be marked up outside the text. Such external mark-up could then be connected to on-going ontology work being done e.g. in the Museum community (CIDOC-CRM). The result of this was the establishment of an Ontologies SIG at the 4th annual members meeting of TEI in October 2004 (TEI Ontologies SIG).

During the three years that has gone since then, work has mainly been done in relation to the CIDOC-CRM ontology (CIDOC 2003). The work has been discussing a general mapping of the TEI standard (Ore 2006), (Eide 2007), as well as semantic tagging of knowledge areas not commonly expressed in TEI (Eide 2006). Work has also been done to create web systems using Topic Maps partly based on information from TEI documents (Tuohy 2006).

Mapping of TEI documents

In this paper, we will discuss mapping from TEI to CIDOC-CRM based on practical experience with documents. We will use TEI documents with place names, person names and dates marked up as input. Based on such documents, will demonstrate the creation of a CIDOC-CRM model based on the tagged place names, persons names and dates in the TEI documents. Further, we will discuss the usability of such models in information integration, as well as what is not included in the model because it is not marked up in the TEI sources.

We will then go through recent and proposed additions to TEI P5 and see how documents in which such elements are used may lead to better and more usable CIDOC-CRM models. We will study the use of the person element, the place element and the proposed event element. We will also discuss the added work needed for such tagging, as well as the added level of interpretation, possibly leading to documents with less general acceptance.

3.3.2. Reaching out to new communities: TEI, History and GIS

Dr. Miranda Remnek (Head, Slavic & East European Library, UIUC)

Traditional audiences for TEI applications have included linguists and literary scholars. Other scholarly groups have been less involved, including historians – even though, at the very least, the early texts routinely encoded by literary medievalists are of clear interest to many lines of historical inquiry. Thus, if one of the goals of the TEI consortium is to devise strategies to expand the breadth of TEI applications across the disciplines, TEI educators should surely consider ways to mesh effectively and more broadly with new directions in history. In doing so, they should also consider the potential application of GIS (Geographic Information Systems), which are now increasingly widely used outside the science and social science communities in which they arose.

TEI experts are, of course, already aware of the expansion of GIS in scholarly analysis, and the Council is currently considering changes in tagging practices related to geographic names. But promoting linkages between the TEI and GIS as a means of satisfying additional needs for scholarly analysis is only half the story where history is concerned. Although GIS applications have long been valued by social scientists, historians remain divided on the virtues of such applications, despite the overtly spatial nature of much historical inquiry. Work is therefore needed not only to assist cutting-edge efforts to integrate GIS with textual data, but also to account for the specific chronology problems experienced by historians in their application of GIS, and other complications arising from incomplete historical evidence.

Such obstacles notwithstanding, this paper will attempt to propose some strategies for reaching out to history scholars by emphasizing the potential benefits of linking GIS with TEI. It will begin by (1) investigating and summarizing the state of the art in terms of enriching TEI textual data with GIS applications; and (2) discussing the specific problems related to GIS applications in history articulated by scholars like Ian Gregory in European studies and Martyn Jessop in Slavic studies. It will continue by (3) describing various current and future digital history projects (like the Early 19th C Russian Readership and Culture Project and the Islam-Eurasia project) which already have or are considering a TEI component and are planning to develop GIS components (particularly in relation to travel texts); and (4) suggesting ways in which such projects can help attract historians to the use of TEI by incorporating cutting-edge techniques like GIS, and by demonstrating how such technology can successfully interact with encoded texts to produce new analysis paradigms for historical data. The plan will be to provide a succinct overview of the issues and pitfalls to adduce when using GIS applications to expand the use of TEI encoding throughout the history community.

3.3.3. Music Encoding: A New Direction for the TEI

Perry Roland (University of Virginia)

There is a growing recognition by the music community that existing schemes for representing music notation are inadequate:

Most are hardware or software dependent. The lack of acceptance of specialized hardware input devices for music, such as tablets and touch screens, and outdated storage mechanisms, like punched cards and paper tape, should stress the necessity of hardware neutrality of new representations. However, software dependency, that is, relying on the operation system 'de jour', is equally undesirable.

Nearly all current music representations are also severely limited in scope. Existing representations frequently define their approach to music encoding too narrowly, concentrating on a particular use of the data such as printing or automated performance. Usually, they place a great deal of emphasis on the visual, rather than the semantic, characteristics of the musical text. Other representations, such as the Standardized Music Description Language (SMDL), have attempted to represent music too broadly. SMDL has been unable to attract a large user group in part because it is difficult for potential users and tool developers to see how SMDL might apply to their particular situation.

Many existing codes are also proprietary. Therefore, their use for information exchange is severely limited.

The Music Encoding Initiative (MEI) project was started to address these and other music encoding issues. MEI has been gaining favorable reception among those engaged in digital music scholarship. For example, the MeTAMuSe project has said, 'In contrast to MusicXML, which is the de facto industry standard, but which is rather limited in the representation of musicological concepts such as multiple divergent sources, MEI has definite advantages in the musicological context'. MEI has also been described as one of 'two really serious contenders' in this problem space by Michael Kay.

MEI's abilities to encode variant/parallel readings, encode music notation other than common music notation (CMN), support non-transcriptional text commentary / annotation, and allow linking to external media, such as page images or performances, will hopefully foster new efforts to create scholarly editions of music using XML. Such content-based encoding modeled on text encoding formats is not only best-suited to the development of these digital editions, but can potentially best document the intellectual process of the development of the corpus, making the critical work better suited for verification and scholarly argument.

MEI was deliberately modeled after TEI so there are many similarities between them. MEI shares design philosophy and design characteristics 'comprehensive, declarative, explicit, hierarchical, formal, flexible, and extensible' with TEI. It also provides extension / restriction and internationalization mechanisms like those found in TEI-P4. The MEI meta-data header is similar to the TEI header.

Due to these similarities, there may be opportunities for cooperation between MEI and TEI. Until now, the design of MEI has been a one-man operation. MEI could certainly benefit from the extensive, albeit probably non-musical, markup expertise of the TEI membership. Likewise, TEI members could benefit from the ability to include music markup within their TEI-encoded texts. Perhaps the formation of a music-encoding special interest group within the TEI Consortium would be an appropriate action.

3.4. Markup as theory

Convenor: Daniel O'Donnell

3.4.1. Markup schema as theory

Wendell Piez

Considering a markup schema as a "theory" of the text requires (1) defining what we mean by "schema", and (2) considering what we mean by "theory of the text". As to (1), I think it reasonable to define "schema" as a formal or informal set of constraints to be enforced on a given set of markup instances (a "document type" in SGML parlance). For current discussions, it is probably sufficient to limit considerations to XML schemas, at least initially. As to (2), I think that keeping an eye on this issue as we go forward is perhaps the best tactic for proceeding.

According to our definition of schema, any of these would qualify:

  1. A formal set of declarations, in XML DTD syntax, XSD or RelaxNG, providing content models and attribute declarations (DTD), element and attribute content types (simple or complex) (XSD) or patterns (RNG), to which all the elements in any given XML instance are expected to conform.
  2. A set of Schematron assertions, which may operate like content models or may include more outlandish kinds of constraints, such as "every @id begins with the string 'ID'" (sch:assert test="not(//@id[not(starts-with(.,'ID')])"/>.
  3. A set of constraints expressed in natural language, such as "quote elements may contain paragraphs, or text with mixed content consisting of inline elements, but not both", or "every @id begins with the string 'ID'".

The fundamental problem in considering schema-as-theory is in determining the "modality" of the theory. Is it a "must", a "should" or a "does"? (See Renear 2000, 2003.)

All of these may be tested by the question "what is the status of a document claimed to be a member of the type, but not conformant with the schema"? Ordinarily, we would say such a document is "wrong" and needs to be fixed. Yet this assumes that constraints declared in the schema must always trump expressions of tagging in a document; it does not allow for the possibility that the schema may be wrong. Not only is this perhaps too Draconian in practice (schemas to admit of improvements); it actually sets aside one of the more useful applications of a schema (especially "soft" schemas such as Schematron assertion sets): the identification of documents that fail to conform not because they need to be fixed, but because they are interesting. (See Birnbaum 1997.)

Of course, considered under the mantle of "schema as theory", such an application might be admitted by saying that we do not expect our theories to be comprehensive or correct; indeed there is heuristic value to them even or especially when they are found not to be.

The distinction between prospective and retrospective markup is also relevant (Piez 2001). When the purpose of tagging is considered to be prospective, the constraints expressed by a schema can generally be taken as a subset and stand-in for the contraints required by a processing architecture for "correct" operation. When tagging is retrospective and the correct or adequate representation in markup of a set of original documents is the goal, the phenomena presented by these documents may – given an adequate expression of the goals of the documents' representation in markup – be considered to trump the schema.

In the former case (markup considered as prospective), it seems grandiose to consider a schema to be a "theory of the text". It is nothing but a warranty that certain downstream operations may be performed confidently. It is no more a "theory of the text" than the statement "I can drive my car to the grocery store and return with the groceries packed in the trunk" is a theory of what it is to be an automobile.

In the latter case (retrospective markup, like most TEI applications), terming it a "theory of the text" may be more apropos – except that interestingly enough, this theory may be most interesting and illuminating when it fails. That is to say, if an adequate statement of the goals of representation in markup is necessary for us to determine whether, in a given case of an invalid document, the document or the schema is correct, then it is that statement which presents the theory to which we must appeal – determining the schema to be merely an adequate or inadequate expression of that theory.

3.4.2. Literary Text as Equation: The Critical Implications of Digital Humanism

John Carlson

While textual scholarship in the later print era was defined by disagreement over the relative value of reconstructed authorial works and readings found in physical texts, an equally important argument over the role of empiricism informed much of that period's earlier criticism. On the one hand, Karl Lachmann's stemmatic analysis and the variant calculus of W.W. Greg typified the arguments of those favoring a methodical approach to literary studies closely modeled on the disciplines of math and science. Opposing what they took to be a crutch for unimaginative thinkers, on the other hand, scholars like A.E. Housman countered that interpreting a text's condition requires an artistic intuition capable of transcending pseudo-scientific method. The latter position triumphed in print criticism, due in part to broader developments in literary studies, and inadvertently contributed to the decline of all forms of quantitative literary analysis from the linguistic to the metrical. Such work did not disappear completely, of course, but the shift in critical emphasis is obvious if one compares recent editions of medieval poems to those produced in late nineteenth-century Germany.

Despite the extensive reconsideration of editorial method prompted by the advent of digital humanism, however, few equivalent studies of how the electronic medium might redefine this arguably more fundamental aspect of print culture has yet taken place. Such an oversight is unfortunate because the structural analysis and quantitative method encouraged by hypertext encoding, especially as defined in the TEI recommendations, has serious implications for the future of electronic scholarship. It is possible, for instance, that the hesitation among literary academics to utilize digital resources stems to some degree from either reluctance or conceptually inability to pursue the types of investigations for which the new medium is ideally suited. Even more importantly, though, failure to acknowledge and exploit the affinity between electronic markup and the more scientific methods advocated by earlier textual scholars obscures those advantages along with the continuities between print and digital humanism. As demonstrated by the detailed linguistic descriptions offered in the Piers Plowman Electronic Archive's facsimile editions and my experiments with metrical analysis in the Morte Arthure, it is in the area of quantitative analysis that hypertext will improve most markedly on the printed codex. Practitioners of digital scholarship should therefore be prepared to defend their works as improved applications of an empirical view of textuality that is just as legitimate and pedigreed as interpretations based on artistic intuition.

3.4.3. Markup as Theory of Text

Klemens Bobenhausen

Hans Walter Gabler

To conceive of markup as a theorizable approach to text must imply a differentiated conceptualizing of 'text'. Does the term refer to the 'work', or to the 'expression'? (this distinction in turn derives from 'Functional Requirements for Bibliographic Records – Final Report.') Considering that libraries have always been the administrators of the material media (books, manuscripts), they have, by way of the considerable influence that their segmentations have exerted, also always indirectly contributed to structuring text theory. Other disciplines in the humanities and sciences may vary widely in the ways they articulate their concepts of 'text'. Yet all their texts are commonly library-administered. Consequently, the theorizable classifications of library science may be expected to influence the text theories of the disciplines.

In terms of library science and its specific problems, latest-generation markup must on all accounts be technically capable of projecting onto a 'work' all 'expressions' of that 'work'. Future library catalogues should therefore no longer be confined to identifying media ('manifestations'), but be enabled to relate the contents of the 'manifestations', i.e., the 'expressions', to the given 'work'. Yet technical capability on the infrastructural level of data processing will only be reachable in consequence of an intellectual solution of the markup problematics that are ultimately theoretical in nature.

Markup describes and differentiates the attributes of 'works', 'expressions', and 'manifestations'. The complexities into which this assumption may lead will be exemplified in terms of poems as insets in novels, as collections, as cycles, or as anthologized, respectively. Anyone attempting, on the basis of such variegated material appearance, to categorize and thus to mark up each entity correctly and logically coherently, enters as deeply into the materials as he or she does into theoretical assumptions about texts.

This is going to be a main line of argument of our presentation, which in its turn however will also emphasize a necessary correlation of such a library science approach to critical, text-critical, and editorial practices and their theoretical implications. In German scholarly editing, in particular, markup avant-la-lettre has been practised for over half a century, especially in relation to draft manuscripts. The need to mark up drafts so as to differentiate their genetic layering might, in terms of theory, be seen as complementary to the problem of library science to establish the proper relation of 'works', 'expressions', and 'manifestations.'

3.5. TEI Education

Convenor: Werner Wegstein

3.5.1. TEI for teachers of TEI

Lou Burnard and Sebastian Rahtz (Oxford University Computing Services)

3.5.2. TEI for the classroom

Werner Wegstein (Universität Würzburg)

In October 1990, after four years of work, Nancy Ide stated in her preface to the TEI P1 Guidelines (Draft Version 1.1, p. VII) under the headline 'The Next Stage': Standards cannot be imposed: they must be accepted by the community' Now, twenty years later, aiming at version P5 of the TEI Guidelines this statement is still valid, not only for the humanities research community. It is equally valid with respect to training and education of text encoding along the TEI guidelines. Apart from the invitation on the TEI website: Teach Yourself TEI and the support listed there, teaching TEI at universities has been going on for quite some time with a low degree of visibility, reflected in the subject lines of the TEI mailing-list. Searching the TEI-list archive with more than 10000 entries between January 1990 and June 2007 returned 21 matches for 'teaching', 17 for 'training' and 4 for 'education'. Teaching TEI at universities seems a promising way to broaden the user basis, but you will have to convince the students first that the learning outcome of being able to handle electronic with philological diligence is worth the effort.

This paper will report on experience with teaching TEI (P3 and P4) to postgraduate humanities students with very different philological background and degree of computer literacy in different MA programmes and teaching scenarios at the universities of Würzburg (, click the link 'Aufbaustudiengänge') and Exeter (, with the special extension of a Joint Degree) using very low-tech classrooms (blackboard and chalk), high-tech computer labs, video-conferencing systems or these in combination. And it will aim at drawing some conclusions for the teaching of TEI P5 at university level in future.

3.5.3. Distance Education for the TEI: the example of the Electronic Ælfric project

Dot Porter (University of Kentucky)

The Electronic Ælfric Project, directed by Aaron Kleist at Biola University and housed at the Collaboratory for Research in Computer for Humanities at the University of Kentucky, is using the TEI to build an image-based edition of a selection of Ælfric of Eynsham's First Series of Old English homilies – text addressing such issues as the distinction between spirit, mind, and will, sexuality and martyrdom for laypersons and monks, and the relationship between human merit and divine election. The Electronic Ælfric will examine a crucial set of eight homilies for the period from Easter to Pentecost, tracing their development through six phases of authorial revision and then through nearly 200 years of transmission following Ælfric's death: twenty-four sets of readings or strands of textual tradition found in twenty-eight manuscripts produced in at least five scriptoria between 990 and 1200. These eight homilies are of particular importance, for they alone appear to have been reproduced through all six phases of production.

For the first phase of this project, the participants are using the TEI core, plus the 'transcription' module, to build an 'initial encoding' of each version of each homily. The next phase will involve expanding this initial encoding to include pointers from text elements to the relevant portions of the image files, and the final phase will involve building all the variant versions into a single, cross-readable and cross-searchable edition. The first phase has involved the hands-on work of the project director, four encoding editors (specialists in Old English language), the project manager, and two undergraduate students. The director and one of the encoding editors has some experience with XML encoding, but the other three encoding editors are 'newbies' to TEI and XML in general. This means that a major part of phase one has focused on the training of all project members, especially those for whom TEI is entirely new. This effort has been somewhat complicated by the circumstances: the team is distributed throughout the US and Europe. This means that a 'classroom approach' is not realistic. Instead, the project manager has focused on wiki tutorials, email communication, and weekly 'code checks' to ensure that team members are learning the basics and encoding the material correctly.

In this presentation, the project manager will describe the 'distributed TEI education' and how it is working for the Electronic Ælfric project. She will show the wiki tutorials and discuss the methods of importance of communication in the success of this project.

3.6. TEI applications

Convenor: Ray Siemens

3.6.1. Rethinking the TEI Table Model

David Birnbaum

The dominant XML table models, as reflected in TEI and other standards (such as HTML and OASIS CALS) are presentationally oriented and do not reflect the meaning of the table. Semantically, tables are complex data structures that (at a minimum) associate a two-dimensional matrix (row and column position) with a value (cell content). Nothing inherent in most table data requires that one dimension be forever expressed in rows and the other in columns, but the models in question preassign one dimension to rows and the other to columns, as if one orientation were inherently more natural or correct than another. For years, some have argued that an information-based table model would overcome these limitations, but the early discussions were before XSLT, nearly-ubiquitous presentational transformations, and spreadsheets that 'save as XML'. Is it time to reopen the table-as-data debate? The author presents a table model that bears a close relation to spreadsheet markup models, but that is sensitive to the needs of XML authoring.

3.6.2. More than tags: a report from the Markup Analysis Project

Paul Scifleet (Information Policy & Practice Research Group, University of Sydney)

This paper will report back to members of the TEI community on the findings of the Markup Analysis Project, a study commenced at Sydney University in December 2006 to investigate the use of the Text Encoding Initiative (TEI) guidelines in scholarly communication and electronic publishing.

Between January and May 2007 over thirty institutional and individual TEI practitioners from thirteen different countries participated in the Markup Analysis Project. Over 220mb of digital texts (TEI encoded files) were submitted for analysis. The texts contributed are remarkable. The diversity of organizations, geography, history, languages and culture represented is a valuable insight into the significance of the Text Encoding Initiative in making these works accessible.

Markup Analysis Project background

Although there are a large and growing number of TEI practitioners involved in e-publishing and scholarly communication projects, to date very few studies have focussed on exploring and understanding electronic text encoding as an experience shared across many projects. While there have been many case studies and individual observations published by practitioners that give voice to the success and challenges of digital projects, it has been difficult to bring the lessons learnt forward to the development of theory and shared methodologies for understanding, adopting and implementing markup language vocabularies.

To address this the Markup Analysis Project posited a document centred view of electronic text encoding that would investigate the digital content itself, the units of content that populate information systems, along with an investigation of aspects of documentary practice that might be influencing design and development.

In essence the study proposed a markup usage study. While the availability of software capable of parsing and interpreting encoded documents provides the opportunity to process a large amount of quantitative data quickly, so far, studies (of HTML encoding) that have taken this approach have had only limited success in explaining the phenomena they witness. To understand markup usage, studies need to move beyond 'counting tags' and towards an understanding of how the (human) encoders of digital documents relate to this process of classifying and representing texts.

To achieve this objective the Markup Analysis Project combined a survey of practitioners (self administered questionnaire) that explored the community and organizational context of electronic text production with an automated analysis of the markup in the texts they submitted. Information visualisation software was developed in-house to allow us to report on the markup used. The survey was designed to bring sense to the scale and type of interactions that are taking place across projects and, with that objective in mind, participant interviews were included to allow us to verify and expand on findings from the first two components of the study.


By addressing the question: How are markup languages being used in practice? This study offers the opportunity to contribute to understanding the representation of text in digital form. Through the development of analytical software, the research is making inroads towards the development of new applications for content analysis in the field of scholarly communication and electronic publishing. It is envisioned that over time, the development of analytical methods for markup analysis will contribute to:

  • managing changing standards (e.g. identifying redundant and changing elements within documents);
  • comparing, synchronising and merging different information sets;
  • educating and training users involved in the design and development of markup based systems; and
  • supporting research activities specific to organizational content or academic enquiry.

Preliminary findings from the study will be presented.

3.6.3. TEI annotations for parallel text comparison

Peter Boot (Huygens Institute, The Hague, The Netherlands)

There is a frequent need for comparison between two or more parallel texts. Texts are translated, adapted for another age or another audience, or reworked for some other reason. In situations like this, a variant apparatus and electronic collation tools will be of little help, as at the word level the texts may have very little in common. In the Emblem Project Utrecht we encounter this situation both within books (when multiple language versions of a text are given with a single illustration), between books in the same language (when new texts are written for older illustrations) and between books in different languages. In the last case, translation is often more than creating an equivalent in another language: when e.g. a book is translated from Latin into the vernacular, this typically brings with it accommodations to a lower standard of learning.

The paper will propose a TEI customisation, created with Roma, for doing annotation on parallel text structures. It continues work on the EDITOR annotation tool and SANE annotation exchange mechanism. It will assume a situation where one or more TEI-documents contain the parallel texts that are to be annotated; the annotations will be stored in a new document. To handle the situation where one or more of the text versions are not available as a TEI document (perhaps existing only as a collection of digital images or just on paper), I will introduce the notion of a 'TEI proxy document': a document that contains enough structural aspects of the text to provide a basis for annotations to be attached to.

The proposed customisation will describe the structures needed to (1) point to the text fragments that will be compared and the texts that contain them (e.g. the poems that will be compared to their versions in other languages), (2) define the pairing between text fragments that will be the basis for further study, (3) define the annotation types that will be used to annotate the linked text fragments, and (4) the actual annotations. The annotation types will be defined by the individual researcher, as we assume they will be specific to the subject of their research. They will consist of multiple fields with their own data types. They will be implemented using the feature structure mechanism. The annotation types that the researcher creates will be stored as feature declarations. When annotating a text fragment or fragment link the system will present the researcher with the appropriate features and possibly the values that they allow. The values that the user enters will then be stored as feature structures.

The customization will be shown in a working prototype of what a parallel text annotation tool might look like. I will show this tool at work in a comparison of two books from the Emblem Project Utrecht.

4. Posters

4.1. MONK and the TEI: Visualizing Thousands of Patterns in TEI-Encoded Texts

Tanya Clement, Catherine Plaisant (U of Maryland)

The MONK (Metadata Offer New Knowledge) Project builds on work done in two separate projects funded by the Andrew W. Mellon Foundation: WordHoard, at Northwestern University, and Nora, with participants at the University of Illinois, the National Center for Supercomputing Applications, the University of Maryland, the University of Georgia, the University of Nebraska, the University of Virginia, and the University of Alberta. The two projects share the basic assumption that the scholarly use of digital texts must progress beyond treating them as book surrogates and move towards the exploration of the potential that emerges when you put many texts in a single environment that allows a variety of analytical routines to be executed across some or all of them. A main focus for the project has been to leverage visualization tools for their ability to represent a large amount of data on a single screen. For example, analyzing a text like The Making of Americans by Gertrude Stein has already provided rich opportunities for thinking about both tool development and processes of literary analysis. Initial analyses on the text using the Data to Knowledge (D2K) application environment for data mining has yielded clusters based on the existence of frequent patterns. In terms of TEI encoded texts, features could be considered any element such as <head> <name>, or <condition>. With Making, the features used were phrases (i.e. ngrams). Executing a frequent pattern analysis algorithm produced a list of patterns of the phrases co-occurring frequently in paragraphs. This frequent pattern analysis generated thousands of patterns because slight variations generated a new pattern; another algorithm was applied that clustered these frequent patterns. Although the analytics that were used are sophisticated, the results were not presented in a manner that makes them easy for users to understand. For example, a cluster with 85 frequent patterns was presented as a list with each item divorced from its context in the text. Yet, this presentation is unsatisfactory when features of a text, such as TEI metadata or phrasal repetition is always contingent on context, on what comes before and after any particular string of letters and on what those strings (or words) mean in the context of a sentence, a paragraph, and the larger narrative or project. While text mining allows the literary scholar to "see" the text "differently" by facilitating analytical approaches that chart features like TEI metadata or phrasal repetition across thousands of paragraphs, visualizations that empower her to tweak the results, focus her search, and ultimately (re)discover "something that is interesting" are essential to using any text mining process. This poster will present MONK's current practices in visualizing patterns mined from thousands of TEI-encoded XML documents.

4.2. The Conversion of St Paul: A TEI P5 Edition

Dr. James Cummings (Oxford Text Archive, University of Oxford)

I have been working on a scholarly edition of The Conversion of St Paul, a late-medieval play found only in Bodleian MS Digby 133. This includes a transcription of the manuscript, with introduction, scholarly apparatus and notes, delivered alongside images of the manuscript publicly available (under a restrictive license) from the Early Manuscripts at Oxford University site. In addition the site includes generated content such as word indices and concordances which interoperate with other freely available resources such as the Middle English Dictionary. As a dynamically displayed edition, users will be able to switch on and off certain presentational and scholarly features as desired. Moreover, the edition explores the use of a number of text encoding features introduced in TEI P5, such as the <choice> element, and the possibilities of stand-off markup for repetitive <choice> structures and linguistic regularisation. All of the underlying files are written in TEI P5 XML with a customised TEI P5 ODD file used to generate a schema to validate against as well as provide additional documentation concerning the nature of the text encoding schema used. It is hoped that those reading the poster will be made aware useful and straight-forward the generation of the related additional content is with TEI P5, and the benefits for this sort of edition.

4.3. Encoding Serials with the TEI: The Independent Header and Interoperability

Michelle Dalmau and Melanie Schlosser (Digital Projects & Usability Librarian, Indiana University Digital Library Program)

The Indiana University Digital Library Program received a Library Services and Technology Act (LSTA) grant to digitize and encode a 102-year run of a scholarly journal known as The Indiana Magazine of History (IMH). The journal features historical articles, critical essays, research notes, annotated primary documents, reviews, and notices. We decided to encode at the issue level in order to maintain the conceptual integrity of the print journal. In order to represent the richness of the content, however, we needed a way to capture 'article-level' metadata. The Text Encoding Initiative (TEI) guidelines and community of practice offered a number of potential methods for representing article-level bibliographic metadata, including TEI Corpus, Metadata Object Description Schema (MODS), and article-level TEI documents that link to the parent issue. After exploring these and other options, the Independent Header eventually emerged as the best way to encode a complex serial. The auxiliary schema for the Independent Header (IHS) was developed to allow the exchange of bibliographic metadata for text collections to support the creation of indices and other aggregations. The creators of the schema did not envision the use of the Independent Header in serials encoding, but this method has a number of advantages. Utilizing the IHS in this fashion allows the TEI to function as the authoritative metadata source for the document, and allows the encoder to faithfully represent the issue-based structure of the original without compromising the unique identity of each article. It also supports our larger goal of interoperability with other text collections. Since the IHS is part of the TEI standard (P4 and earlier), the encoder does not have to extend or modify the DTD. This not only simplifies documentation needs for management and preservation, but also allows for easier reuse of content and integration with other collections. The poster will provide an overview of the challenges we faced in determining the best encoding approach for The Indiana Magazine of History with particular emphasis on our use of the Independent Header. We will explore the advantages and disadvantages of the IHS as well as explore alternatives to the IHS in light of P5 and our current digital library infrastructure. We will present survey findings and testimonials of how others have used or are using the IHS, and how current usage informs our own practices of the TEI standard. Lastly, we will illustrate how the digital library mantra of interoperability is upheld by the TEI-encoded journal, which serves as an authoritative representational source from which we can automatically derive Metadata Encoding and Transmission Standard (METS) documents and other standard bibliographic and textual representations that underlie discovery and page-turning functionality for the online version of the Indiana Magazine of History

4.4. The Manuscriptorium system

Jindřich Marek (Dept. of Manuscripts and Early Printed Books National Library of the Czech Republic)

Manuscriptorium is a system for collecting and making accessible on the internet information on historical book resources, linked to a virtual library of digitised documents. The Manuscriptorium service is financed by the National Library of the Czech Republic and managed by AiP Beroun s.r.o. Manuscription now contains 2165 full digital copies of whole manuscripts (or prints) and 88725 evidence records. We have made a site with demonstration of the system functionality:

The poster will focus on:

  1. The inner functionality of the system (a schema, an information about TEI standards used).
  2. The workflow for manuscripts (or prints) processing (a schema).
  3. The presentation of the system (in the notebook).

4.5. Outsourcing Complex Digitization: Lessons Learned

John Carlson, Mary Ann Lugo, and David Sewell (University of Virginia Press)

The University of Virginia Press's ROTUNDA imprint for digital scholarship has completed one large digitization of a printed documentary edition (The Papers of George Washington, 52 volumes), and is in the midst of a second (The Papers of Thomas Jefferson, 33 volumes). The editions are being converted to P5 TEI-XML with minor schema customization.

As our goal for The Papers of George Washington Digital Edition was to produce a richly hypertextual online edition, our markup specifications went beyond basic structural tagging to include many features of what we called 'second-level tagging', i.e., tagging for cross-references, expansions of abbreviations and short titles, and several types of metadata. We outsourced all document rekeying and tagging to a data conversion vendor, and our expectation was that document transcription and tagging would all reflect a similar high level of accuracy.

Our expectations turned out to be overoptimistic. Transcription accuracy was generally excellent, but with some troubling patterns of global errors. Certain types of metadata tagging (e.g., dates) were handled quite well, others (e.g., names) were a source of many errors. Correct tagging of document cross-references proved to be much trickier than we had thought. We learned that we had had inflated expectations of what an outsource bureau can accomplish without the sort of specialized training that documentary editors have, and that we had made mistakes in our division of tasks between in-house and outsourced effort. Luckily, the process of checking and correcting errors also taught us a lot about how to combine human and programmatic checking of data to clean up many of the first-pass errors.

In this poster session we will draw upon our experience from working with PGWDE and other digitization projects to share our findings in several areas:

4.6. The Digital Edition of Statuta Comunis of Vicenza

Viviana Salardi (Verona University (Italy)) and Luigi Siciliano (Florence University (Italy))

Statuta Comunis are collections of civic rules very common in Northern Italy since the 12th century. We are working on a brand new digital edition of the Statuta of Vicenza (1264), a town near Venice. With the purpose of rectifying mistakes and misinterpretations of the first printed edition, published in the nineteenth-century (Statuti del comune di Vicenza. MCCLXIV, edited by F. Lampertico, Venezia, Deputazione di storia patria, 1886), we also aim at rendering the inherent fluidity of the text.

The most relevant matters to deal with are: first of all, to emphasize the particular structure of this kind of texts; secondly, to mark the additions and amendments of the manuscript and, finally, to connect the different editions of the same text.

The choice of XML and specifically TEI P5's guidelines allowed us to develop a hierarchical data model for the structure, to promote comparison of the continuous revised versions of the first text and to enclose a detailed description of the unique codex. We tried to be as compliant as possible to the chosen standard and we used Roma to elect the tagset we needed, to slightly customize our RelaxNG and to produce related documentation according to the ODD model. Moreover, we were able to implement our work by planning gradual steps, thus limiting the problems of meagre funds and using on-line examples and ready-to-use materials to improve our skills faster.

Progress of the work can be checked on our Web site, where actually, besides the main points of our project and a list of used tags and attributes, we published a beta release of the first book. HTML code was obtained by applying to the xml encoded text a customization of Sebastian Rahtz XSLT open source Stylesheet and a CSS tableless layout. Further books will be progressively published together with their fully available xml encoding and any other related materials (XSLT, CSS etc.).

In such a 'step by step' policy we'd like to provide different layouts according to scholars' needs and add semantic values to the text, providing a comprehensive markup of names and locations.

4.7. The Versioning Machine 3.1: Developing Tools for TEI

Susan Schreibman, Ann Hanlon, Sean Daugherty, Anthony Ross (University of Maryland)

The Versioning Machine is an open-source software tool for displaying and comparing multiple versions of texts. Designed by a team of programmers, designers, and literary scholars, the Versioning Machine has progressed through several versions of its own. The most recent iteration of the Versioning Machine was accompanied by a user / usability study, designed to collect data on its navigability and ease of use, as well as the utility of the Versioning Machine for the scholars and practitioners to whom it is directed. This poster will summarize the findings of that study, including its implications for tools for the digital humanities in general, in addition to revisions to the Versioning Machine itself.

4.8. Usage of TEI P5 for data interchange between projects: the Anglo-Saxon charters case study

Arianna Ciula and Elena Pierazzo (Kings College London)

This poster will show some experimental work on the interchange of data between two Anglo-Saxon projects carried out at the Centre for Computing in the Humanities, King's College London.

The current project LangScape (The Language of Landscape: Reading the Anglo-Saxon Countryside) and the completed pilot project ASChart (Anglo-Saxon Charters) use TEI XML to represent the Anglo-Saxon Charters' diplomatic discourse and description of the English countryside, respectively. Although being guided by different scholarly aims and perspectives, both projects focus on the same medieval sources and use TEI XML to markup their structural and semantic composition.

The TEI P5 guidelines have promoted as best practice the use of ODD (One Document Does it all) in creating and documenting XML mark-up schemas and, in doing so, provide an effective upport to data interchange between different projects. This poster will show how ODD can be used to make explicit the constraints behind the encoding schemas of LangScape and ASChart (converted to TEI P5 for this purpose) and, specifically, how the correct usage of namespaces can allow for clean and efficient sharing of data. In the ODD we can document both the use of project-defined namespaces as well as the inclusion of elements belonging to international standards such as the W3C XInclude scheme.

Although the focus of the poster will mainly be on the content of the boundary clauses in Anglo-Saxon charters, the proposed system of data interchange could be applied to the archival information related to the charters in the future. Indeed, the TEI header could potentially be shared by other projects that make the Anglo-Saxon charters the focus of their investigations.

4.9. Points of intersection between Library of Congress activities, TEI, and support for digital humanities studies.

Caroline Arms (Library of Congress, Office of Strategic Initiatives)

The Library of Congress (LC) digitizes volumes from its historical collections, explores options for collecting content created today to ensure availability to future scholars, and maintains standards and vocabularies that can support digital humanities scholarship. This poster introduces a selection of such activities. LC maintains several XML-based standards used by libraries and archives to facilitate access to their holdings. Like TEI, these standards are adapting to support richer interoperability. Descriptive standards include MODS, developed as a schema for describing resources without the syntactical constraints of traditional MARC catalog records; MADS, a schema for terms in controlled vocabularies and for authoritative and alternate forms of names; and EAD, for describing archival collections and their contents according to archival principles and practices.

Current digitization activities for text include newspapers, through the National Digital Newspaper Program (a partnership with the National Endowment for the Humanities), and out-of-copyright works on American history. For large-scale projects, tradeoffs have to be made between detailed markup and what can be achieved by fully automated means. For collecting content created today, tradeoffs are also needed; articles marked up with TEI-like logical structure allow the textual content to be preserved but do not guarantee that the original look and context can be reproduced. Exploring the tradeoffs is an explicit element of current activities.

A new contribution to interoperability is the deployment of Library of Congress Control Numbers as persistent URL-based identifiers providing access to the source catalog record in several formats. A more experimental activity is merging and enhancing records for named individuals maintained by LC and the German national library to create a Virtual International Authority File (VIAF).

4.10. The Gaiji Module in Action

Aming Tu, Marcus Bingenheimer (Dharma Drum Buddhist College)

The poster demonstrates how the new Gaiji Module in TEI helps to provide a encoding solution for variant or rare CJK characters, commonly refered to as "gaiji 外字" ("characters outside") in Japanese, or "quezi 缺字" ("missing characters") in Chinese. The Bieyi za ahan 別譯雜阿含 Project at Dharma Drum Buddhist College is one of the first projects that fully implement the TEI Gaiji Module. We work with ancient Buddhist texts that contain a large number of CJK charactes that are either not yet in Unicode or belong to the CJK-Extension B area, for which client-side font support can not be expected. We use the Gaiji Module to encode and describe these characters in a standardized fashion. The glyphs are rendered via JavaScript and image files of the characters for the user.

The poster will also touch on the way to produce character files with the Adobe SING gaiji architecture – a proprietary standard that includes a XML metadata wrapper for the character file which we use in the print-publishing section. We will also point out what problems arose in the use of the Gaiji Module. Some of these are conceptual, such as the lack of a clear cut difference between glyph and character, others practical, such as the need to provide descriptions for CJK-Ext B characters that in theory are not within the scope of the Gaiji Module.

4.11. A sanity checker for TEI schemas

Bernevig Ioan, Véronika Lux Pogodalla, Bertrand Gaiffe (Institut de l'Information Scientifique et Technique, CNRS Analyse et Traitement Informatique de la Langue Française, CNRS)

This paper describes a new functionality added in a prototype version of Roma: the "sanity checker."

Roma provides an interface for users to write TEI customizations. But currently, users select the desired modules and then the desired TEI elements within modules as if elements were completely independent from each others, which is not at all the case: TEI elements are structured in several ways (modules, attribute classes, elements classes) and there are many hierarchical relationships between elements. Roma does not make this explicit to the users and indeed allows for the specification of customized TEI schema that cannot be "satisfied" ie. no document instance would validate against such schema. It can also lead to TEI schema some parts of which will never be used.

We have analyzed possible errors that can occur while performing TEI customization and have developed a checker for these errors. The development was made in PHP so as to integrate easily in Roma, as an add-on.

4.12. From XML to XSL, JQuery, and the Display of TEI Documents

Stephanie A. Schlitz (Bloomsburg University)

Garrick S. Bodine (Penn State University)

Our current work with the TEI is focused on two main areas: 1) increasing the availability of Early Modern Icelandic resources for historical, literary, and linguistic research and 2) extending the range of web tools available for use with TEI P5-conformant projects. In this poster session, we address the latter of these two areas.

With reference to two of our in-progress projects, The Copenhagen Sagas and the Digital Hybrid, we demonstrate the tools we're currently developing to support the display of TEI documents. Specifically, we offer examples of our modifications to existing TEI XSL stylesheets, in particular modifications that support the representation of elements contained in the Manuscript Description Module of the Guidelines, as well as our use of the JQuery Javascript Library as a means of extending the functionality of XSL stylesheets.

4.13. Mapping Multi-Rooted Trees from a Sustainable Exchange Format to TEI Feature Structures

Andreas Witt, Georg Rehm, Timm Lehmberg, and Erhard Hinrichs (Tübingen University)

The Generalised Architecture for Sustainability of Linguistic Data (GENAU) was devised by the project "Sustainability of Linguistic Data" that is funded by the German Research Foundation (DFG). One of the main objectives of this preservation project is to process and to make sustainably available a wide range of heterogeneous linguistic corpora based on hierarchical and timeline-based XML-data. A concrete goal of the project is to come up with one or more sustainable exchange format(s) which allow for the concurrent encoding of multiple annotation layers. To achieve this goal, an architecture for the representation, annotation, and exchange of linguistic corpora is needed.

The GENAU-approach comprises the separation of individual annotation layers contained in the data (normally represented by XML-based data formats such as EXMARaLDA, PAULA and TUSNELDA) into multiple XML files, so that each file contains a single annotation layer only. A number of semi-automatic tools and XSLT stylesheets were developed to transform data that conforms to the abovementioned markup languages into multiple XML files.

This poster is a follow-up to presentations by Thomas Schmidt (Hamburg University) and Andreas Witt (Tübingen University) at the TEI Annual Members' Meeting in Victoria, and Sofia that reported on work in progress with regard to the development of the data format and mappings to existing markup standards. Stimulated by the discussion at the workshop "Putting the TEI to the Test" (Berlin) in April 2007, we decided to support different TEI-based annotation-schemes, especially for the base tag set for the transcription of speech, the additional tag sets for linguistic analysis, and the additional tag set for feature structures. Our poster presents preliminary results of the work on an exchange format and, especially, of using the TEI tag set for the representation of feature structures.

4.14. A proposal for encoding stemmata codicum in XML

David J. Birnbaum (University of Pittsburgh)

This poster presents a model for encoding a stemma codicum (a graphic representation of textual transmission in a manuscript tradition) in XML. It is not TEI-specific, but it is easily mapped onto TEI graphing resources. It also incorporates a schematron-based alternative to traditional id/idref validation, information about transforming a descriptive XML representation into a presentational SVG one (as well as presentations based on other standard graphic formats), and a discussion of how an XML-based stemma representation may be used not only for rendering, but also for analyzing patterns of variation.

4.15. TEI in Libraries: A Perspective from the Digital Library Federation

Barrie Howard (Digital Library Federation)

The Digital Library Federation (DLF) has been supporting the development of TEI in digital libraries since 1998, when it sponsored a workshop held at the Library of Congress. An outcome of that event resulted in the publication of the "TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices" a year later. Now in version 2.1, the guidelines continue to be reviewed and revised through the efforts and expertise of the DLF TEI in Libraries Task Force.

This poster will plot a timeline of milestones in TEI history, highlighting DLF's contributions in context and providing some details. These will include improvements made in each version of the guidelines, and the composition of each task force responsible for authoring each version.

The DLF best encoding practices for digital libraries has been developed synergistically with other initiatives in the TEI community, such as the TEI in Libraries SIG and the EEBO-TCP DTD Working Group. DLF is enthused to be part of this community, and is committed to the sustainability of TEI in libraries in support of digital scholarship.

4.16. Digital edition of late medieval town statutes: Visualising the evolution of text and law

Malte Rehbein (University of Göttingen / National University of Ireland, Galway)

One important source for the study of urban life (economy, society, administration) all over Europe from the 12th century to (early) modern times is provided by municipal statutes.

This presentation is based on the analysis of "kundige bok 2," one of a series of administrative records of late medieval Göttingen. It contains statutes about the regulation of everyday life which were approved by the city council and read aloud to an assembly of the buren (citizens, fellow inhabitants). These so called burspraken took place regularly to remind the people of the regulations and to provide them with up-to-date information.

But these statutes were everything else than fixed and unchangeable. Whenever the council reacted on political, economic or social changes, the scribes had to modify the written statutes by either altering the existing text or adding completely new versions. What has come to us is, thus, a multilayer text that is difficult to read and understand without any deep investigation.

This contribution presents the digital edition of "kundige bok 2." It is based on TEI markup and provides:

4.17. A crosswalker to move documentary papyri into TEI XML

Gabriel Bodard (King's College London) and Hugh Cayless (University of North Carolina)

In August 2007 Duke University (in close collaboration with Universität Heidelberg, Columbia University, and King's College London) began work on a project funded by the Andrew W. Mellon Foundation to begin the migration of the Duke Databank of Documentary Papyri into EpiDoc XML (a TEI localization).

This work will involve the combination of three streams of data:

  1. The 60 000 Greek texts of the DDbDP, which were composed in Beta Code (a legacy system for recording Greek characters and structural elements using only symbols made up of ASCII characters), and are now wrapped in an ad hoc TEI-based SGML structure. These files need to be converted (entirely automatically) into (a) validating XML; (b) Unicode (normalization form D); and (c) modified EpiDoc XML. A sophisticated self-testing scheme involving both XSLT, round-tripping of data and schematron validation will help to ensure that this process is as robust as possible.
  2. The metadata on these same papyri, currently held in a Filemaker Pro database at the Heidelberger Gesamtverseichnis der griechischen Papyrusurkunden Ägyptens (HGV) will be matched to the texts, tagged in the XML files (although much of the data does not have a well-defined home in the teiHeader). Some investigation of the applicability of metadata schemes to this data will also be carried out. There is a minimal degree of many-to-one matching to be handled in this stream.
  3. Translations of some of the texts into both English and German, also from Heidelberg, that currently live in a home-grown XML schema, will also be matched to these files and converted to EpiDoc-style TEI XML.

The combination of these three data streams, whether united into single TEI XML files or as a bundle of linked data, will be delivered to Columbia University's Papyrological Navigator, which is also being developed as part of the Mellon grant. We hope that we will be able to take the project forward with further funding in the future to build editing tools and expand the metadata handling.

4.18. Assignments from Japanese Guideline Project: problems on localization of TEI guideline

Ohya Kazushi (Tsurumi University)

During the course of Japanese translation for the TEI I18N project, we have faced problems which cannot be solved by ourselves and might require consideration in the wider perspective of the TEI. The poster focuses on reports and suggestions about internationalization and localization support in TEI activities. The purposes of I18N and L10N are defined to accommodate multiple linguistic and cultural features respectively.

Japanese researchers have struggled with tagging linguistic features following the TEI Guidelines. Ruby (Ruby, 2001), for example, has been regarded as annotation with an explicit target (Hara and Yasunaga, 1997; Yabe, 2001). As a way to indicate annotation, the TEI Guideline provides an element <note> which points directly to a place or defines a range with attributes, but there is no system for specifying content for the link. However, by using the <seg> element with anonymous content (mixed-content, or once called pelement in HyTime) more actively, it would be possible to indicate the relationship, allowing ruby to be encoded within the present TEI system. As regards kunten (an auxiliary character to read Literary Chinese as Japanese), this can be regarded as annotation to indicate a place assisting in reading or transcription (characters serialized with kunten can be recognized by two stack automaton, or Turing Machine. I have implemented a toy program to accept Kunten and make well-ordered Characters as Japanese – Ohya, 2008). It seems possible to encode this in TEI. For characters which go beyond Unicode, the situation remains uncertain as I indicated at Sofia (Ohya, 2005). Accordingly, I could say that a policy to endeavor to remain within TEI strategies in multiple languages will work well with the TEI Guideline adapted for multiple language features.

Concerning L10N, there are two areas of observation; a set of transliterated targets and a way to handle semantics in the TEI Guideline. For transliterated targets, there are at least five types; a) element names (or GI), b) the gloss, c) examples, d) short descriptions in definitions, and e) long descriptions in the body of sections. Here I would like to suggest a new L10N selection rule letting localizers choose the targets. For example, in the Japanese case, we will translate c) and d) but not a) and c), and neglect or newly add b) depending on the situation. As for a way to handle semantics in the TEI Guideline, I would like to suggest that we create two reference sections for remarks and examples. These remarks refer not to original comments written in English but to supplementary comments written in other languages to explain the difference or background behind the story of explanation in the Guidelines. This has the benefit of keeping the Guideline simple, of being easy to compare differences between languages, and of providing an easy way to add an explanation, which can avoid unnecessary change in Guidelines written in local languages (this might require us to make new schemas, interfaces, link structures, etc. Of course, there is another choice that localized guidelines are at liberty to change the content of the Guideline. However, I think this is a bad policy for a new research field which I will describe). The reference section for examples has the benefit of leaving original examples alone, and allowing new examples without any relationship to the original, and to observe differences between languages or cultures easily. This approach will not only enable the TEI Guideline to be useable in multiple cultures but also open a new field of research on a contrastive study of marked-up description.

4.19. TextGridLab and TEI baseline encoding: towards a community grid in the humanities

Werner Wegstein (University of Würzburg)

TextGrid, started late in 2005, is the acronym of the humanities' partners of D-Grid, an initiative, funded for three years by the German Ministry of Education and Research (BMBF) to establish a grid infrastructure for research. Coordinated by the Göttingen State and University Library, five institutional partners (Technical University Darmstadt, Institut für Deutsche Sprache, Mannheim; University of Trier; University of Applied Sciences, Worms, University of Würzburg) and two small commercial companies (DAASI International, Tübingen and Saphor, Tübingen) aim to create a Virtual Research Library, which entails a grid-enabled workbench that will process, analyse, annotate, edit, link and publish text data for academic research in an Open Source and Open Access environment using TEI markup.

TextGrid resources and functionality are encapsulated in an open and grid-enabled infrastructure, which is linked to D-Grid. TextGrid relies on Eclipse as integration platform and workbench. So the TextGrid tools, that cater for a range of requirements, can be plugged in seamlessly into the TextGrid Eclipse environment called TextGridLab or can be re-mixed in other contexts into various frameworks with minimal effort: interactive tools like XML-editor, image link-editor and workflow editor, streaming tools like lemmatiser and tokeniser, and tools for converting data, for metadata annotation and user management. A prototype of TextGridLab will be displayed at the conference.

The TextGrid middleware facilitates semantic search over the entire TextGrid data pool. The search module accesses metadata as well as normalized content data. A baseline encoding scheme enables text retrieval across independent projects by mapping different project encoding schemes of varying granularity on a common TEI baseline encoding. This is a key innovation developed together with other humanities computing projects encoding different varieties of texts like dictionaries, drama, letters, critical editions or poetry.

The main objective of TextGrid is to develop a new kind of grid-based research infrastructure for collaborative textual processing in a very general sense. Since collaborative textual processing is as yet not widely spread within the humanities realm, TextGrid aims at taking an active role in building a grid community in the humanities, trying to convince the users that adopting the TextGrid facilities of collaborative textual processing will lay the foundations of a new active e-Humanities community and help add connectivity, precision and scope to their research. In terms of the Aspen paper on the culture of networking technology, TextGrid puts into practice a "'pull' approach, from a hierarchical center-out structure to a network-based decentralized architecture" ("When push comes to Pull...." A Report by David Bollier, Washington 2006, p. V).

4.20. The Treasury of World's Fair Art and Architecture

Patricia Kosco Cossard (University of Maryland)

The digital library Treasury of World's Fair Art and Architecture serves a number of research and teaching needs. Its purpose is to give interested learners direct contact with historical material and to experience "world's fair connoisseurship." It grew directly out of the curricular needs of the University of Maryland honors seminar "World's Fair: Social and Architectural History" whose learning outcome is to give undergraduate students with no design training an appreciation for the visual quality and cultural role of architecture. The first seminar offering in 2001 utilized a simple HTML website called "the online exhibit" that consisted of individual but linked HTML pages sampling images from the World’s Fair Collection in the Architecture Library. Originally, the final assignment was three encyclopedic essays for selected images from this virtual exhibit. Essays which met strict editorial guidelines of content and scholarly research standards were subsequently added to the virtual exhibit website with simple HTML links. Since these selected published essays serve as a textbook for the subsequent seminars, with each seminar new essays are added and older essays replaced. Because it was an open-access website, researchers worldwide began consulting the essays and the posted images. A steady flow of inquiries received by the Libraries revealed this resource to have great value, however its rudimentary architecture made for inelegant navigation and intellectual discovery.

In 2006, the University of Maryland Libraries completely revised, redesigned, and expanded this resource, publishing A Treasury of World’s Fair Art & Architecture. It provides searchable access to art- and architecture-related images, virtual tours and contextual essays covering a broad number of world's fairs. Within a larger digital library framework on a Fedora platform, this resource integrates high-quality digital images, a curated gallery, traditional library catalog searching and a virtual textbook. The Treasury was conceived and developed collaboratively by staff in multiple UM Libraries units, including the Architecture Library; Digital Collections and Research, Technical Services Division, and Information Technology Division.

4.21. <oXygen/> 9

James Cummings and Sebastian Rahtz (SynchRO Soft)

Version 9 of the oXygen XML editor has added many new facilities of interest to TEI users. We will be demonstrating the beta release of the software, including: