The title of this essay is borrowed from Lou Burnard's classic " Gentle Introduction to SGML." This is a must-read for anybody who wants a serious introduction to mark-up languages. It is, however, not quite so gentle as it says it is, and if your acquaintance with a computer is pretty much a matter of reading e-mail, using a browser, and writing with a word processor, some of it may be tough going. This essay assumes that you have done no more, but it tries to persuade you that as a doctoral student in the humanities you should know a little about "structured documents" and the markup language of the Text Encoding Initiative (TEI), which has been the standard protocol for the digital encoding of scholarly texts in the humanities. You may or may not end up using it in your work, but understanding its basic principles will help you make judgments about the reliability of the edititions and texts you use. In your scholarly work, as in every other aspect of life, you have to trust that you can safely use many things that you don't understand. You need not, and indeed cannot, know everything, but you need to know enough to figure out whether trust is or is not in order. Knowing something about mark-up languages is an excellent way of developing appropriate trust in the electronic documents on which you will increasingly rely in your scholarly work.
You can find an excellent and more technical introduction to TEI than this document at http://www.tei-c.org/Guidelines/Customization/Lite/.
The full description of the markup language and its elements is found in the official Guidelines for Electronic Text Encoding and Interchange
Computers still arouse peculiar forms of enthusiasm and frustration. Only fifteen years ago, they were largely the domain of highly technical personnel. Since then they have become a lot easier to use, but they have also become more opaque.The expectations and promises of user-friendliness have raced way ahead of performance. It is therefore always useful to remember that computers do not think, do not talk, and do not understand anything. They are machines that carry out instructions to the letter. They are not as transparent as books or bicycles, and few of us have had as much early training in them as we had with those marvels of human ingenuity, but we will save ourselves much frustration if we banish unreasonable expectations about using a powerful and complex tool without some understanding of how it does what it does.
Fortunately, for the purpose of dealing with mark-up languages you can consider most of what a computer does as a black box to be taken for granted and rely on some humble everyday skills: putting labels on boxes, putting stuff in boxes, putting the boxes into other boxes, and remembering always to put the same stuff in the same box and the same box into the appropriate kind of other box. Markup is all about content and containers, and it is useful to remember the fundamental simple-mindedness of the process, hairy as it may get in its details.
The TEI markup language is a form of SGML or Standard Generalized Markup Language. SGML has been around for almost thirty years. It was developed in the seventies by Charles Goldfarb at IBM , and it has become an international standard (ISO 8879). For most of its life it has lived a vigorous but largely obscure existence in the backrooms of large institutions, where it serves as a way of keeping track of complex technical documentation such as aircraft maintenance manuals or Navy procurement procedures. It became famous when HTML, a pidgin version of it cobbled together by Tim Berners-Lee at the CERN labs in Geneva took the world by storm and within a few years became the lingua franca of the Internet.
When you use a wordprocessor, hitting Control-i on a keyboard generates a code in your file that gives the command "Start using the italic font." When you hit Control-i again, you issue the command "Stop using the italic font." The control codes in a typesetting program work the same way, as does the following snippet of HTML code
in Milton's <i>Paradise Lost</i> |
where <i> marks a " start tag and </i> marks an end tag (You could also think of this as declarative code, but most people use it as an instruction).
A computer program consisting of such code is a sequence of precise instructions in a particular order. Writers of computer books sometimes use the analogy of a cookbook and suggest that a computer program is like a recipe. This is actually a very misleading analogy, for the instructions in a cookbook presuppose the hovering presence of an intelligence capable of nuance and discretion ("season according to taste").
Now consider the code fragment:
in Milton's <title>Paradise Lost</title> |
This piece of code is not written in the imperative but in the indicative mode. The tags declare that "the words between us are a title." This declaration carries no implication what, if anything, is to be done about titles. On the other hand, if you want to do something with them (print them bold, extract a list, replace them with numbers), you will know where to find them.
This difference is the crucial characteristic of SGML. To mark up or "tag" a document in SGML or any of its variants means to identify its components according to a set of rules. This is sometimes referred to as "semantic," "logical," or "structural" markup, and it carries no implications about the formatting of a document. On the other hand, the "declarative" markup of the document can serve as the basis for formatting instructions, as in hypothetical code carrying out this instruction:
when you come across words between the tags <title> ...</title> put them in italics |
The separation of structural declaration and fomatting instructions is at the heart of SGML. While it is a simple distinction to grasp logically, it runs against the grain of our writing habits in which structure and format are deeply intermingled. We are very familiar with writing sentences like
There are very few risqu� passages in Paradise Lost |
or in HTML
There are very few <i>risqu�</i> passages in <i>Paradise Lost</i> |
In writing sentences in this way, we depend on on the reader to infer structure from formatting, and as writers we have been trained from an early age to use formatting to provide appropriate clues to the reader.
Appropriate clues need not be very precise. Italics could stand for a title, a foreign word, or plain emphasis, and the context is usually good enough to figure it out. Similarly, an indented passage usually means that it is a quotation, and so forth. The reader's tacit knowledge of such conventions sits outside the formatting instructions.
In structured markup that tacit knowledge is made explicit. Consider the following passage:
There are very few <foreign>risqu�</foreign> passages in <title>Paradise Lost<</title> |
If I combine this text snippet of "declarative markup" with a "stylesheet" that gives the instruction to put in italics whatever words occur within <title> or <foreign> tags, I get this result
There are very few risqu� passages in Paradise Lost |
We are back where we started, and you might well ask "Why not tell the computer to put this stuff in italics in the first place?" There are good answers this question, but it is also a good question. Most people don't want to separate content from format--witness the fact that word processors to this day have remained formatting machines with no capabilities for structural markup.
It is not quite accurate to say that SGML is a markup language. More properly, it is a set of rules for making markup languages. An SGML conformant markup language follows certain rules about making and naming its containers and specifying what can go into them. The name for an SGML container is "element," and an SGML conformant language is basically a set of rules about such containers and their "content models."
The rules for a particular SGML markup language are called a "document type definition" or DTD. A DTD is itself a document that follows rules specified by SGML
The strength and weakness of SGML derive from the same fact: you need a document type definition, which means that you have to think ahead.Writing in SGML or any of its variants involves a willingness to shoulder upfront investments for the sake of downstream benefits. Who wants to do that?
SGML would never have broken out of the limited terrain of technical documentation had it not been for Tim Berners-Lee, a young English physicist who in the late eighties as an employee of the CERN Labs in Geneva tinkered with a tool that would let scientists share their work across the Internet without having to worry about formatting their documents. He developed HTML, the hypertext markup language, which in its original version was an extremely simple and not entirely correct form of SGML.
HTML in its original form consisted of very few "elements" and had a very loose content model. Berners-Lee worked from the way in which writers work with keyboards to segment their writing, and the basic html tags, such as <p>, <h1>, <i>, <u> <li> for paragraphs, headlines, italics, underlining, lists show this clearly enough. The writer of an HTML document need not worry about the document type definition(DTD), because it is the same for every HTML document, and is in fact built into every browser, the software tool designed to display an HTML document to the reader in an easy-to-read format.
From one perspective, HTML observes the SGML dogma of separating declarative markup from formatting. Consider the following
<p>There are very few <i>risqu�</i>
passages in <i>Paradise Lost</></p><p>Critical
efforts to show the pervasive humor of the epic have also been relatively
unsuccessful.</p> |
There are very few risqu�
passages in Paradise Lost.
Critical efforts to show the pervasive humor of the epic have also been relatively unsuccessful. |
In the first passage the text is put into paragraph and italics containers or elements. The second passage shows how the browser renders the passage after it identifies its elements.
On the other hand, the passage shows that the distinction between document structure and layout is not clearly observed by the HTML set of elements, which derive from the writerly practice of habitually blurring this distinction.
HTML was successful beyond anyone's wildest dreams for two reasons:
Like every other good tool, HTML was very quickly used for purposes that its inventor did not anticipate and that even worked against his original intention. The original deal of HTML was after all that you could write your document without worrying about formatting: the browser would take care of that. But web designers very quickly turned their ingenuity to controlling the layout on the reader's screen. The fate of the <table> element is particularly interesting and ironic in this regard. It was added early in the life of HTML as an additional element to enable the display of information in rows and columns, but because the loose content model of HTML lets you put just about any container in any other container, the table element became the basic layout tool for Web design: today, and an HTML page consists typically of a number of tables in which table cells of different dimensions are the basic layout units.
One consequence of the proliferation of layout options in the successive versions of HTML is that while its markup follows the letter of SGM (sort of) HTML always lives in sin because it constantly violates the cardinal rule of separating information from the mode of its display.
While it may be the case that the SGML dogma of separating information from its display is ultimately untenable, it is for many practical purposes important to do so. In particular, if you want to use the Internet to move stuff in and out of databases, it becomes very useful to have a markup language with clearly defined containers and content models.That is the impetus behing XML, the "Extensible Markup Language," which will supersede HTML wherever complex and precise information is at a premium.
XML is a subset of SGML, and like SGML it is a metalanguage for making markup languages. It does not consist of a fixed set of tags or elements, like HTML, but is a bundle of SGML conformant rules for making up elements and specifying their content models. There are several ways in which XML differs from SGML:
XML is not quite there yet, although it is clear getting there. In particular, displaying XML on a browser is still problematic. Most XML documents go through a translation routine that turns them into HTML documents before they are displayed, and only the latest browser versions can display XML directly.
Why not write an HTML document in the first place, if the XML document has to be translated anyhow? While there are good answers to that question, it remains a good question. This document, for instance, is written in HTML because I haven't quite learned how to write an XML document and display it with links in a frameset. But for my scholarly work I now compose XML documents using the TEI markup language. By the time my stuff is finished, the display problems will be solved.
This is a good place to point that transformability is a major virtue of an SGML or XML document. Writing on a typewriter or wordprocessor isn't quite like chiselling letters in stone, but writing is still governed by the idea that form and content will come together in a final version that says everything the right way and looks pretty on the page. The XML writer has no such illusions. S/he creates documents that can morph with ease from one output format into another, and the composers in XML are willing to invest considerable time into the morphing qualities of their documents. So the fact that at the moment an XML document is likely to be transformed into an HTML document for the sake of display is not a temporary condition. It's just a very specific version of the unstable conditions of all documents: they are always going to be displayed differently or partially, and you want to compose documents that are both sturdy and flexible.
We are at last ready to look in some detail at TEI, the SGML conformant markup language of the Text Encoding Initiative. This international project has been around since the eighties and concluded its major work just before the Web came on the scene.
The goal of the TEI has been to create an environment in which documents of scholarly interest could be encoded such that the properties of the documents would be represented in their transcribed form and the resultant transcription would be independent of any particular programming environment and survive technological change.
By sheer accident early manufacturers only knew how to make paper that happened to be very long-lasting under ordinary storage condition. Eventually, of course, technological progress made it possible to make cheaper paper that would not last as long. If left to their fate, most books published since 1850 will rot in the next generation, while books printed earlier will last indefinitely. Computers are very unlike paper in many ways, and in particular durability has not been their forte. Whether you think of the physical media of data storage (punch cards, tapes, hard drives, etc) or of the encoding protocols, there is not much hope that particular document instances last for decades, let alone centuries. Unless you hold to the perhaps plausible position that an orgy of oblivion would be a good thing for mankind, preservation is a key problem for an emerging digital culture, and we do not have the luxury of early modern culture where the problem of keeping stuff was solved accidentally.
The Text Encoding Initiative is the most systematic effort so far to create standards for scholarly memory in an evolving digital culture.The TEI did its work in the SGML environment because it has been the firmest and least proprietary standard for encoding information. SGML is also robust in two other ways that are worth pointing out. SGML markup can be read by humans (though it is not pretty), and it presupposes only one other technical standard, ASCII, or the set of rules by which Roman alphanumerical characters are mapped to binary numbers. A computer that can process ASCII characters can process an SGML document. So as long as there are humans who know English and computers that can process ASCII text files, a TEI-encoded version of Paradise Lost is completely decipherable.
If preservation is not a concern, these are negligible issues. If preservation is a concern they are huge. The mere preservation of information is, like plumbing or electricity, something whose importance we tend to ignore until it breaks down.
In the following more practical remarks I restrict myself to the xml version of the TEI.
The TEI markup language includes about 450 different elements to satisfy all manner of scholarly needs in the humanities. That is a lot of elements for a markup language. There is a "lite" version of the TEI, which is largely based on the experience of marking up the wide variety of electronic texts held in the Oxford Text Archive (OTA). Teixlite, as the xml version is known, contains about 150 elements, which is not particularly light. For pedagogical purposes, I have constructed a baby version of the TEI, TeiXBaby, which contains about 60 elements. This may or may not be sufficiently capacious markup language to encode a variety of texts, but it should be helpful in getting a grasp of the language as a whole. The following discussion and examples are limited to the scope of TeiXBaby, which is a subset of Teixlite, with some simplification of the content models for some of the containers.
The rules for a particular XML markup language are specified in the document type definition or DTD. A document that conforms to its DTD is called a "valid" document, and an XML editor such as XMetal includes a "validator," i.e. a software routine that checks the document against the rules and either declares it valid or generates a list of errors.
A DTD is the kind of thing that is supposed to be neither seen nor heard, and if you use a standard DTD, like the TEI, you can do so without ever looking at it. It is, however, useful to look at it at least once. Here is a link to the TeiXBaby.dtd, in which I have arranged things so as to make the document easier to read.
A DTD is a set of "declarations" about three different kinds of things:
We will stick to elements for a while and return to attributes and entities later.
The computer doesn't care about the order in which such declarations are made, but it cares about the form of each separate declaration. A declaration must be enclosed by the symbols <! >and it must declare
Thus the following is the declaration for the "root element" of a TEI document or the container in which everything else is contained:
<!ELEMENT TEI.2 (teiHeader,
text) > |
This declares that there is an "element" with the name "TEI.2" and that it must contain, in that order, the elements "teiHeader" and "text." We must also declare the elements teiHeader and text:
<!ELEMENT teiHeader (content model) > <!ELEMENT text (content model) |
N.B.You must not leave any blank space between <! and the thing declared. The computer knows how to process "<!ELEMENT" but will give you an error message for "<! ELEMENT". This is the kind of "literalism" that causes much frustration in writing instructions for computers.
Once we've made our declarations, we can imagine a minimal document structure. This structure consist of the nested elements preceded by a "prolog" which does two things:
In a document, the element acts as container and contains everything that occurs between its start tag <element> and its end tag </element>. Thus the following is a minimal model of the document so far declared:
<?xml version="1.0"?> <!DOCTYPE TEI.2 SYSTEM "http: //faculty-web.at.northwestern.edu/english/mmueller/TeiXBaby/TeiXBaby.dtd"> <TEI.2> |
The first line must occur in every XML document, and its beginning and closing characters <? ....?> identify it as a "processing instruction." It tells the browser that it is an xml document and should be dealt with in a certain way
The second line varies with the document. It is a declaration as witnessed by its tags "<!...>." It declares the document type (DOCTYPE) as TEI.2, and the combination of the keyword SYSTEM with the appropriate URL states where the DTD for the document is to be found.
Note also that XML, unlike HTML, is case sensitive: an element named "Tei.2" would be different from "TEI.2", and a hypothetical end-tag </Tei.2> would not close the start-tag <TEI.2>.
The two basic containers of a TEI file relate to each exactly like a catalog card and a book in a library. The teiHeader element is not part of the document you encode but provides information or "metadata" about it. Like a catalog card, it is a very structured thing. We will ignore it for now but return to it later.
For a fuller description, see The Structure of a TEI text in the TEI Lite Guide.
The text element is the container for the document itself, and its content model is as follows:
<!ELEMENT text (front?, body, back?) > |
This means that the text element must contain a body, which may be preceded by one front element and followed by one back element. This content model is clearly based on the book, which typically has stuff of various kinds at the beginning (title page, prefatory material) and may have stuff of various kinds at the end (indexes, etc.).
This is a good moment to introduce what are called "occurrence indicators" and "group connectors" in a content model. For any element, the following possibilities exist for the frequency of its occurrence and are marked by the symbol that follows the element name:
The order of occurrence either matters or it doesn't matter. In the former case the elements are connected by a comma, in the latter case by the vertical dash "|".
Thus the most restrictive content model for the text element would prescribe one front, followed by one body, and one back element:
<!ELEMENT text (front, body, back) > |
The least restrictive content model would be the following:
<!ELEMENT text (front |
body | back)* > |
This includes the possibility that the text element contains nothing at all. The actual TEI rule states that a document must have a body, just as it states that a body must have something in it.
We will ignore the front and back elements for now and turn to the body element
With the body element we are at least getting closer to the stuff that people actually read and write in a document. So far we have mainly dealt with an abstract shell, and an impatient observer might wonder about the point of being tediously explicit about things that a reader just takes for granted. But that is precisely the point. When we handle a book or "navigate" it, as we now say, a lot of tacit knowledge comes into play as we flip pages and move from prefatory materials to the "book itself" and the stuff at the back, and very simple indicators like the relative thickness of the pages on the left or the right of an open book carry lots of structural information for the reader. But there is no tacit knowledge in a computer file. Things are either spelled out, or they do not exist at all. Spelling them out is a tedious business, but it is also a source of reflection on the extraordinary complexity of the tacit knowledge we draw on in such simple and familiar activities as turning the pages of a book.
The TEI content model for the body element is too complex to represent here. Instead we will proceed by encoding the text of "My Novel," which unfortunately never proceeded beyond the first sentence of Chapter One: "It was a dark and stormy night."
<?xml version="1.0"?> <!DOCTYPE TEI.2 SYSTEM "http://faculty-web.at.northwestern.edu/english/mmueller/TeiXBaby/TeiXBaby.dtd"> <TEI.2>
|
This is not only a complete TEI document, but its tag set is sufficient to write a simple document of some length. The text elements of HTML do not go much beyond it.
For "My Novel" we have introduced three new elements: <head>, <div>, and <p>:
With the <head> and <p> elements, we have for the first time encountered containers that can directly contain stuff rather than other containers. The stuff of a document is words or in more formal markup terminology, parsed character data, abbreviated as #PCDATA. A very simple (but inadequate) content model for the head and paragraph elements would be:
<!ELEMENT p #PCDATA > <!ELEMENT head #PCDATA |
This means that these two elements can contain any amount of text but cannot contain any other containers or elements.
In HTML parlance, there is a distinction between block elements and inline elements. Block elements are the kinds of things that begin a new line, whereas inline elements occur within a line. The practical value of this distinction shows how deeply formatting habits shape our thinking. The full version of the TEI has a formal and quite complex system of classifying element groups, but for the purposes of this introduction, I have grouped the sixty elements of the TeiXBaby DTD as follows:
The nesting rules for these elements are fairly complex, but the following crude triage is a pretty good approximation of what happens in most markup operations:
You can mark up a lot of text with this limited set of elements.
Elements at the paragraph level are the work horses of markup: you write in paragraphs (the <p> element), you identify the units of a poem as lines or stanzas (<l> and <lg>), a passage is identified as a quotation (<q>), or a speech (<sp>), and so forth. The <note> and <stage> elements identify text units that stand in an oblique relationship to the main text. You can also identify all the sentences in a work (the <s> tag) or mark arbitrary segments (<seg>). These last two elements are not really at the paragraph level, but are usefully mentioned in that context.
The pivotal tag here is the <p> tag, and important constraints of the TEI markup language turn on what can contain or be contained by a paragraph:
This pivotal position of the<p> tag is recognized by the formal classification of element groups in the full TEI, which distinguishes between
In the My Novel document I introduced "attributes," and this is a good point to explain them. Attributes are used to add specifications to elements in exactly the same way in which adjectives add specifications to nouns. An attribute always exists as a "key-value pair" in which the name of the attribute is followed by an equal sign, which is followed by the value of the attribute in quotation marks. The attributes values are declared in the start tag of an element with no punctuation but a blank space between different attributes, as in the following:
<div type="chapter" n="1"> . . . . </div> |
You must not repeat the attribute values in the closing tag. If you have ever looked at a standard Web link, you will recognize the syntax of attributes from the example of an href, which is by far the most common attribute:
<a href="faculty-web.at.nwu.edu/english/mmueller">My website</a> |
You cannot just use attributes as you please but must declare them in the DTD in an ATTLIST declaration, which specifies:
In the TEI DTD, every element shares the same four attributes that are known as "global attributes." They are id, n, lang, and rend, and they are declared as follows
<!ATTLIST
element id ID #IMPLIED n CDATA #IMPLIED lang IDREF #IMPLIED rend CDATA #IMPLIED > |
The id attribute is self-explanatory, but its value must be a unique ID value in the document, and the value must begin with a a letter of the alphabet.
The n attribute may take any alphanumerical value, but it is most commonly used to count lines, stanzas, paragraphs or whatever else you want to number. .
The lang attribute refers to the language used in an element. Thus
<q lang="FR"> |
refers to a quotation in French. The value of the lang attribute must be an IDREF or reference to an ID that exists in the document. IDREFs are a generic feature of SGML and are used to enforce consistent references. For example, the <sp> element for tagging dialog has a <who> attribute with IDREF values. The parser will return an error message if the value of the attribute is not matched by a speaker ID.
The <rend> attribute is used to include information about how a particular element is typographically represented, as in the following example:
<q lang="FR" rend="italics"> |
A number of TEI tags have the "type" attribute, which provides an easy way of generating new elements. For example. a combination of "type" and "n" attributes lets you specify the structure of the Faerie Queene quite satisfactorily, as is apparent from the following example, which shows the nesting of a particular line
<div
type="book" n="1"> ........... <div type="canto" n="2"> ......... <div type="stanza" n="5"> ....... <l n="7">The eye of reason was with rage yblent</l> ..... </div> ....... </div> ..... </div> |
Alternately, the stanza could be tagged as a linegroup element. The TEI also has provision for seven levels of numbered divs. Using the <lg> tag and numbered divs, you get the following, which is perhaps a little clearer but amounts to the same thing:
<div1
n="1"> ........... <div2 n="2"> ......... <lg type="stanza" n="5"> ....... <l n="7">The eye of reason was with rage yblent</l> ..... </lg> ....... </div2> ..... </div1> |
I have grouped these three types of elements together because they typically contain material that is illustrative rather than discursive.
A list consists of a list element and one or more items as in the following example:
<list> <item>bread</item> <item> milk</item> <item>bananas</item> </list> |
A list can occur within a paragraph, and the following is a a valid piece of markup, although it may not be worth doing:
<p> I went to the store to buy<list><item>bread,</item> <item>milk,</item> <item>and bananas</item></p> |
On the other hand, the parser will complain about the following markup:
<p> I went to the store to buy<list><item>bread</item>, <item>milk</item>, and<item> bananas</item></p> |
This seems more logical because in each case the <item> tag encloses only the actual list item. But as a result the raw text data "," and ", and" appear in the list element, which does not allow #PCDATA.
A TEI table is in some ways like an HTML table, although it uses different element names. The table element contains row elements, which contain cell elements. The coresponding element names in HTML are <table>, <tr> (table row), and <td> (table detail). On the other hand, the content model for a TEI cell element is very similar to that for a paragraph, which means that it cannot contain paragraphs. Unlike an HTML table, a TEI table cannot be repurposed as a layout tool.
The <figure> and <figDesc> elements are used in TEI to refer to and include visual materials in the text.
There are a number of elements that let you identify words or phrases for various purposes. The most common are
Milestone elements let you insert markers. The linebreak and pagebreak elements <lb> and <pb> are the most obvious examples. The milestone element lets you define any turning point in the text by its "unit" attribute.
Because milestone elements simply mark a point in the text they are "empty" or open and close at the same point in the text. Empty tags are marked by a convention that combines the opening and closing tag in one symbol: <lb/> . In the following example, the linebreak tag is used to represent the lineation of a prose speech in the Riverside Shakespeare:
<sp who="SirTo"> <speaker> <hi rend="i">Sir To.</hi></speaker> <p><lb n="25"/> Fie, that you'll say so! He plays o' th' <lb n="26"/> viol-de-gamboys, and speaks three or four languages <lb n="27"/> word for word without book, and hath all the good <lb n="28"/> gifts of nature.</p> </sp> |
This tagging has the same effect as using <milestone unit="linebreak" n="25"/>, but it is more economical.
References of various kinds are critical in scholarly markup. Specifically bibliographical elements are discussed in the next section. Here I focus on other elements that point to objects inside and outside the encoded document.
The <ref> element is conventionally used for straightforward crossreferences within a text.
The <rs> element (referencing string) is a useful tool for relating different expressions to the same referent or for disambiguating phrases with different referents because it has a "key attribute" that lets you set up a relationship between a phrase and a value, as in the following Shakespearean examples:
<rs
key="Hamlet">Hamlet</rs> <rs key="HamletSr">Hamlet</rs> <rs key="HenryIV">Hereford</rs> <rs key="HenryIV">Bolingbroke</rs> |
The TEI includes pointer elements such as <ptr>, <xptr>, and <xref>, which support very complex and precise hyperlinks. The current draft of the TEI was completed before the Web became popular. From a conceptual perspective, the linking powers of the TEI are much greater than those of simple HTML, and the development of link procedures in XML owes much to the TEI. On the other hand, the current version of the TEI does not have a built-in way of doing simple hyperlinks. In TeiXBaby, I have changed the attribute structure of the xref element in the light of current XML specifications so that the xref element can be used to do the simple hyperlinks on which the Web is built.
The basic bibliographical elements are self-explanatory. The <bibl> can contain raw text or the elements <author>, <editor>, <title>, <date>, <pubPlace>, <publisher>.
In some circumstances, the <cit> tag is a useful container for a combination of a quotation and a bibliographical reference. If, for instance, you work with a text in another language and always include a translation, the <cit> element helps to keep original and translation together, as in the following example
<cit> <q lang="GRC"> <l >eu gar eg� tode oida kata phrena kai kata thumon:</l> <l >essetai �mar hot' an pot' ol�l�i Ilios hir�</l> <l >kai Priamos kai laos e�mmeli� Priamoio.</l> </q> <q lang="EN"> <l >For I know this thing well in my heart, and my mind knows it:</l> <l >there will come a day when sacred Ilion shall perish,</l> <l >and Priam, and the people of Priam of the strong ash spear.</l> </q> <bibl n="Hom. Il. 6.447-449"> (6.447-9)</bibl> </cit> |
The teiHeader element is best thought of as a catalog card for a document. It is sufficiently complex and rigid to meet the demands of a cataloguer in a research library. When you formally encode a document for archival purposes its rigor is essential. If you use a limited tag set of the TEI to write a lecture, it feels like overkill. But the minimal and mandatory elements of the teiHeader are useful things to keep track of in any kind of formal writing
The mandatory elements of the teiHeader can be summarized as follows:
The mandatory elements can be done in a formal bibliographical way or informally. Thus the following is a minimal teiHeader of the current document done in an informal style:
<teiHeader> <fileDesc> <titleStmt> <title>A very gentle introduction to the TEI markup language</title> <author>Martin Mueller</author> </titleStmt> <publicationStmt> <p> Unpublished manuscript put on his website by the author at http://faculty-web.at.nwu.edu/english/mmueller/teixintro <p> </publicationStmt> <sourceDesc> <p>prepared for the Ariadne seminar on information technology and scholarship in the humanities at Northwestern University, January 2001</p> </sourceDesc> </fileDesc> </teiHeader> |
Entities and references to them are a generic feature of SGML. An entity reference is a place holder, typically an abbreviation that refers to something longer, whether a complete name or an entire file. In your document the name of an entity is enclosed by the ampersand and semicolon signs. Since these symbols do not cooccur in an ordinary word, their presence is an unambiguous marker for an "entity reference," i.e. the character string you use to refer to the entity.
Entities are declared in the DTD, just like elements or attributes, as in the following example:
<!ENTITY mm "Martin Mueller" > |
This means that when the parser encounters "&mm;" it replaces it with "Martin Mueller". It will also replace &Shakespeare; with the complete text of Shakespeare's plays if the entity "Shakespeare" is defined as a reference to a file that contains all the plays.
The point of entity references is to save space, but the most common entity references actually take up more space than their replacement values. Because computers are still very inconsistent at handling anyting outside the standard Roman alphabet, it has become customary to express foreign characters with periphrastic expressions that take the form of entity references, such as
ä for "a with an
umlaut" or � é for "e with an acute accent" or � |
This is the standard procedure on the Web.
Entity references are also used to "escape" the angle brackets used as tag markers in SGML documents. When an SGML parser encounters the symbol "<", it interprets it as the opening of a start tag, and if you use it in another way, it will become confused. If you want to use "<" when it doesn't refer to a start tag, you must use its entity reference, which is "<" (less than). Ditto for ">" or > (greater than).
Knowledge of some entity references is built into browsers, but in a general way, the representation of odd characters is one of the peskiest problems you run across when you deal with texts on a computer. The only way to avoid these problems is to stick to modern English or Latin in whatever you write.
Two different encodings of a fragment from Ophelia's mad scene may illustrate the power of the TEI DTD as a way of capturing textual information. The first encoding follows the TEI. The second is an automatically tagged version that use a very simple ad hoc DTD. It was developed by John Bosak to demonstrate basic features of XML.
The TEI version encodes the text of the Riverside Shakespeare and is derived from work done at Northwestern in 1996 for Houghton Mifflin Company. The scene fragment is embedded in a skeleton version of the entire document, including front matter, consisting of the simplified castlist, and a somewhat simplified header. Only TeiXBaby tags are used.
Dramatic texts tend to have a lof tagging because of rapid speaker changes and stage directions. This particular example, short as it is, is complicated by the fact that it includes prose, blank verse, and rhymed verse. In addition, some of the lines are incomplete, the linebreaks of the print edition are marked, and names are tagged with the <rs> tag and key attribute. So this is a very intensely tagged text that would allow an appropriate search engine to recover textual elements at quite high levels of granularity.
It is worth saying that in any SGML document blank spaces and line breaks outside of tags do not matter: text layout is controlled by the software that processes the document. In this example, I have used color, spacing, and a table layout with labels to make it easier for a reader to see the structure of the code. You can also see how this code fragment looks in a browser, where at this point the default display of an XML document shows the tag structure rather than the formatted document.
Document declaration | <?xml
version="1.0"?> <!DOCTYPE TEI.2 SYSTEM "teixbaby.dtd"> |
The root element | <TEI.2> |
The header | <teiHeader> <fileDesc> <titleStmt> <title>Hamlet, Prince of Denmark: an electronic edition</title> <author>Shakespeare,William</author> </titleStmt> <publicationStmt> <publisher>Houghton Mifflin</publisher> <pubPlace>Boston MA</pubPlace> <date>1997</date> </publicationStmt> <sourceDesc> <bibl> <title>The Riverside Shakespeare</title> <author>Shakespeare,William</author> <publisher>Boston: Houghton Mifflin,1974</publisher> </bibl> </sourceDesc> </fileDesc> </teiHeader> |
text | <text> |
front | <front><div type="castlist"> <list><item id="Oph">OPHELIA, daughter to Polonius</item> <item id="King">CLAUDIUS, King of Denmark</item> <item id="Queen">GERTRUDE, Queen of Denmark</item> </list> </div></front> |
body with divs | <body><div type="act" n="4"><div n="4.5" type="scene"> |
Queen speaks | <stage><hi rend="i">Enter</hi>KING.</stage> <sp who="Queen"><speaker> <hi rend="i">Queen.</hi></speaker> <l n="37" part="Y"> Alas, look here, my lord.</l> </sp> |
Ophelia speaks | <sp who="Oph"><speaker> <hi rend="i">Oph.</hi></speaker><stage> <hi rend="i">Song.</hi></stage> <lg part="M" type="song"><l n="38"> "Larded all with sweet flowers,</l> <l n="39"> Which bewept to the ground did not go</l> <l n="40"> With true-love showers."</l></lg> </sp> |
King speaks | <sp who="King"><speaker> <hi rend="i">King.</hi></speaker> <l n="41" part="Y"> How do you, pretty lady?</l> </sp> |
Ophelia speaks | <sp who="Oph"><speaker> <hi rend="i">Oph.</hi></speaker> <p><lb n="42"/> Well, <rs key="God">God</rs> dild you! They say the owl was a <lb n="43"/> baker's daughter. Lord, we know what we are, but <lb n="44"/> know not what we may be. <rs key="God">God</rs> be at your table!</p> </sp> |
King speaks | <sp who="King"><speaker> <hi rend="i">King.</hi></speaker> <l n="45" part="Y"> Conceit upon her father. </l> </sp> |
Ophelia speaks | <sp who="Oph"><speaker><hi rend="i">Oph.</hi></speaker> <p><lb n="46"/> Pray let's have no words of this, but when <lb n="47"/> they ask you what it means, say you this:</p> <stage> <hi rend="i">Song.</hi></stage> <lg part="M" type="song"> <l n="48">"To-morrow is <rs key="StValentine">Saint Valentine's</rs> day,</l> <l n="49">All in the morning betime,</l> <l n="50"> And I a maid at your window, </l> <l n="51"> To be your <rs key="StValentine">Valentine</rs>.</l> <l n="52"> "Then up he rose and donn'd his clo'es,</l> <l n="53"> And dupp'd the chamber-door,</l> <l n="54"> Let in the maid, that out a maid</l> <l n="55"> Never departed more."</l> </lg> </sp> |
body with divs | </div> </div></body> |
text | </text> |
Root element | </TEI.2> |
|
John Bosak's encoding of the same scene fragment is much simpler. It is derived from the Moby Shakespeare, a nineteenth-century edition, and a quick look at the markup shows that the structural divisions are inferred from the typographical layout of the text as captured in the electronic transcript.
The text hierarchy is flatter: the root element PLAY contains a set of disparate children, TITLE, FM, PERSONAE, TITLE, ACT. The text is tagged line by line, but the content of the line element is simply a line of text in the source edition. The line tag does not, as in TEI, refer to a line of verse. So this encoding distinguishes neither between prose and verse, nor between blank verse and stanzaic verse, or complete and incomplete lines.
This tagging is better than no tagging: it lets extract, for instance, a list of all the words spoken by Ophelia. But the comparison with the TEI DTD clearly reveals the difference between on the one hand a coding scheme that is heuristically derived from typographical markup automatically applied and on the other hand, a coding scheme analytically derived and constructed to satisfy requirements of scholarly inquiry.
Click here to see the code fragment in a browser.
Document type declaration | <?xml version="1.0"?> <!DOCTYPE PLAY SYSTEM "play.dtd"> |
Root element | <PLAY> |
Play title | <TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE> |
Front matter | <FM><P>Text placed in the public domain by Moby Lexical Tools, 1992.</P><P>SGML markup by Jon Bosak, 1992-1994.</P><P>XML version by Jon Bosak, 1996-1998.</P><P>This work may be freely copied and distributed worldwide.</P></FM> |
Cast list | <PERSONAE> <TITLE>Dramatis Personae</TITLE> <PERSONA>CLAUDIUS, king of Denmark. </PERSONA> <PERSONA>GERTRUDE, queen of Denmark, and mother to Hamlet.</PERSONA> <PERSONA>OPHELIA, daughter to Polonius.</PERSONA> </PERSONAE> <SCNDESCR>SCENE Denmark.</SCNDESCR> |
Title | <PLAYSUBT>HAMLET</PLAYSUBT> |
Act | <ACT><TITLE>ACT IV</TITLE> |
Scene | <SCENE><TITLE>SCENE V. Elsinore. A room in the castle.</TITLE> |
Queen speaks | <STAGEDIR>Enter KING CLAUDIUS</STAGEDIR><SPEECH><SPEAKER>QUEEN GERTRUDE</SPEAKER><LINE>Alas, look here, my lord.</LINE></SPEECH> |
Ophelia speaks | <SPEECH><SPEAKER>OPHELIA</SPEAKER>
<LINE><STAGEDIR>Sings</STAGEDIR></LINE><LINE>Larded <LINE>Which bewept to the grave did not go</LINE>
|
Claudius speaks | <SPEECH><SPEAKER>KING CLAUDIUS</SPEAKER><LINE>How do you, pretty lady?</LINE></SPEECH> |
Ophelia speaks | <SPEECH><SPEAKER>OPHELIA</SPEAKER> <LINE>Well, God 'ild you! They say the owl was a baker's</LINE> <LINE>daughter. Lord, we know what we are, but know not</LINE> <LINE>what we may be. God be at your table!</LINE></SPEECH> |
Claudius speaks | <SPEECH><SPEAKER>KING CLAUDIUS</SPEAKER><LINE>Conceit upon her father.</LINE> |
Ophelia speaks | </SPEECH><SPEECH><SPEAKER>OPHELIA</SPEAKER><LINE>Pray you, let's have no words of this; but when they</LINE> <LINE>ask you what it means, say you this:</LINE> <STAGEDIR>Sings</STAGEDIR><LINE>To-morrow is Saint Valentine's day,</LINE><LINE>All in the morning betime,</LINE><LINE>And I a maid at your window,</LINE><LINE>To be your Valentine.</LINE> <LINE>Then up he rose, and donn'd his clothes,</LINE> <LINE>And dupp'd the chamber-door;</LINE> <LINE>Let in the maid, that out a maid</LINE> <LINE>Never departed more.</LINE></SPEECH> |
</SCENE> | |
</ACT> | |
</PLAY> |
It is one thing to encode a document; it is another to present it to readers in a readable format. It is worth repeating that the separation of document structure and document layout, which is at the heart of XML, runs against deeply engrained writerly habits. Learning how to write is a profoundly "graphic" activity. One of the major difficulties with XML at the moment is that the technologies for formatting encoded documents have not yet stabilized and that the available tools are either too expensive or not user-friendly enough. If you look at the development of HTML tools over the past five years, you can predict that such tools will come. They are not, however, quite there yet.
There are a number of ways of transforming an XML document into a document for display. I mention them briefly before turning in more detail to the one method that is currently within the reach of a non-programmer.
First, it is possible to convert an XML document into an HTML document. This is done with the help of XSLT, which stands for "extensible stylesheet language: transformations." XSLT is a cousin of XML. Think of it as a set of rules for turning an XML document into another XML document or a document of another kind altogether. This is not yet something you can do for yourself unless you are a programmer. But it is useful to know about it.
Second, and by the same logic, you can convert an XML document into a pdf file, a format suitable for printing. This is also done with the help of XSLT, and it is also currently beyond the range of the amateur.
Finally, you can take an XML document and attach a CSS stylesheet to it. If your browser has an XML capability (as Internet Explorer 5.0 and up has), it will read the XML document and apply the stylesheet rules to it. The document looks just like an ordinary Web document, although it isn't HTML.
CSS stands for "cascading style sheet." "Cascading" refers to the fact that rules once stated for one level apply to all levels beneath it. CSS stylesheets were first developed in an HTML environment as an economy measure. In an ordinary Web document every formatting instruction is repeated in tedious detail in every instance of a tag. A stylesheet, by contrast, is a bundle of rules of the kind: "When you come to an instance of X, apply Style Y to it."
CSS style sheets can be used with XML documents. If you are not going to change the order of your document, they will work just as fine. You can state a set of rules about how to display particular elements, including not displaying them at all. s
See how this works with the TEI-Lite XML version of the last act of the Midsummer Night's Dream, prepared by Craig Berry. This link will take you to the formatted scene. The formatting uses color coding to show what you can do with different elements. Thus the cast list and stage directions appear in blue; speaker prefixes are olive; and names are in bold. Take a look at the underlying XML document by choosing Source from the View menu. You notice that none of the formatting instructions appear in that document. Thus you don't see the typical <b> . . </b> tag around words that appear in bold. No formatting is attached stage tags. If you look at speaker tags, you see that the speaker prefixes are enclosed in a <hi> . . </hi> tag, and this tag carries the attribute 'rend="i"' It turns out that speaker prefixes are in fact displayed in italics, but this tag is not the source of the instruction.
All the formatting instructions come from the stylesheet "shakespeare.css," which is referred to in the second line, where it is part of the processing instruction
<?xml-stylesheet type="text/css" href="shakespeare.css"?> |
This tells the browser: apply a stylesheet to this document. The stylesheet is of the type "css," and its address is "shakespeare.css"(which means that the stylesheet is in the same directory as the document).
Now take a look at the stylesheet. Your computer may do things automatically with files ending in the extension ".css," and for that reason I have made a copy of the stylesheet with a "txt" extension. You can get it by going to the URL http://faculty-web.at.nwu.edu/english/mmueller/ariadne/MND/shakespearecss.txt
The file will open up in your text editor. What you see is not very exciting, but it is quite simple to follow. Every line consists of an element name followed by instructions in curly braces. Within curly braces, instructions are separated by a semi-colon. The terms of these instructions are pretty self-explanatory, except for "block" and "inline." This is HTML jargon. A block element is displayed as a block of text with a blank line preceding and following. An inline element occurs inside a block element. Or more simply: a block element forces a line break, but an inline element does not.
In a few cases you see one element name followed by another without any punctuation. Here the second element is the "child" of the first or "parent" element. Thus "div0 head" refers to the head element in a div0, and "div1 head" refers to the head element in a div1.
You may want to save the Shakespeare scene, the stylesheet, and the teilite DTD to your computer, put them all in the same directory, and then experiment with restyling the document. This is a set of skills that applies equally to HTML and XML. Fancy HTML tools, such as Dreamweaver and Frontpage, have built-in support for choosing style settings. This makes it easier to avoid mistakes, but it also mystifies the entire business, which is quite simple. You may or may not want to write your stylesheets by hand, but it helps to know just how simple they are as documents.
To my mind the best discussion of this and related matters is found in the XML Bible by Elliotte Rusty Harold. This is one of many books on XML, but it is special because it approaches everything from the perspective of a writer rather than a programmer. It is also very helpful if you're going to stay within the HTML environment. Many of the underlying issues are the same, and the writerly emphasis, clarity, and general intelligence make this a particular good book to use. It certainly does an extremely job of explaining in relatively non-technical language the kinds of technologies and protocols that are the framework for TEI in XML.
Jerry Goldman's archive of Supreme Court materials is built around that audio files of Supreme Court arguments. Making these documents available to the wider public may be thought of as demystifying the sacred space of a written legal opinion. Alternately, it may be thought of as re-embedding written abstraction in a fuller reality.
In this archive, the TEI is used to encode the written transcripts and synchronize them with the audio files so that you can listen and read at the same time. This involves a triple translation:
The third step is of technical interest only, and we skip it here. But you can listen to the stretch from which the following example is taken by going to http://oyez.org/election/00-949.portraits.ram (You will need to install the free Real player to hear the file and watch the transcript. It's useful to look at two different XML realizations of a transcript. An argument in a court of law is structurally quite close to drama, and you see that there obvious resemblances between the encodings of the Ophelia fragment and a fragment from the second Supreme Court hearing about the last election.
In the following I compare different ways of encoding the transcripts of a Supreme Court argument. The example in both cases is a snippet from the Klock interlude, the single moment of levity in the Supreme Court hearing about the last election. The first transcription uses the DTD of the Transcriber program. In the second version, the same information is encoded in the TeiXBaby DTD ( which is compatible with Tei Lite and the full Tei). The comparison focuses on the structural articulation of discursive units. I have simplified the technical elements by which the transcript is synchronized with the audio source file.
In both examples, the fragment from the transcript is embedded in the entire structure of the document. You can listen to
The transcriber DTD uses a quite simple text model that has analogs to drama. You may want to look at a textfile of the DTD without its attributes, which is relatively easy to read.
For a representation of this xml file in a browser, click here.
|
It is worth pointing out that there are special rules for transforming XML
documents into other kinds of documents. These are called XSLT, and the transformability
of XML documents is one of their great advantages. "Transformation"
may mean any number of things, including formatting for printing, selection
of parts of the documents, changing parts of it.
<?xml version="1.0"?> <!DOCTYPE TEI.2 SYSTEM "http://faculty-web.at.nwu.edu/english/mmueller/babyTEI/teixbaby.dtd">
<TEI.2> |