Using msDescription with Other Manuscript Description Standards or: TEI as interchange format for manuscript descriptions
(Torsten Schaßan, i-d-e / Herzog August Bibliohtek Wolfenbüttel)
Abstract
The chapter msDescription as metadata format is relatively new in the TEI system -ignoring the fact that its predecessor MASTER has had a longer history. However, it can serve as format for information interchange as it does e.g. in Germany. The central database for manuscript descriptions in Germany, Manuscripta Mediaevalia holds manuscript descriptions in a format called HiDA/MIDAS. TEI works as an interface format to this database. Another crosswalk will soon be designed from TEI to AMREMM. Specialities of these formats and the crosswalks to TEI will be discussed in this paper.
Contents
Slightly different from what was proposed in the application for the panel, my paper will deal mainly with the encoding standard used by the German project and central manuscript database, Manuscripta Mediaevalia (ManuMed). To understand its structure, contents and behaviour I have to get back into the history of this database as well as the history of the underlying format. Only after I have dealt with this in greater depth, I will compare this to the "rising star" on the manuscript description horizon, AMREMM (=Descriptive Cataloging of Ancient, Medieval, Renaissance, and Early Modern Manuscripts) which is an application of MARC to medieval manuscript cataloguing. Problems and results of crosswalks between these two formats and the TEI will be discussed. In the end the value of TEI as interchange format shall be evaluated.
The (hi)story behind Manuscripta Mediaevalia
In Germany, the Tiefenerschließung of manuscripts in a modern way started in the 1960's. The cataloguing rules still prevailing (Guidelines for manuscript cataloguing, 5th ed. 1992) have been developed in 1973. The rules have been assembled on request of the main funding organisation, the German Research Council. Since then, thousands of manuscripts have been described following these highly standardised and structured rules. And hundreds of catalogues have been published, most of which are today -in one way or another- included in ManuMed. Retroconversion of printed catalogues started in the 1990's. But back then only the indices have been dealt with. This resulted in a first database, the database Handschriften des Mittelalters. Bernd Michael was responsible for the project which formed the Gesamtindex mittelalterlicher Handschriftenkataloge (1993). The information had been stored in a system called DBI-LINK (=Deutsches Bibliotheksinstitut, u.a. ZDB). This system was meant to be an information system for periodicals.
ManuMed: The contents
Just to mention it explicitly again: The starting-point for the manuscript database have been index entries which form a very flat data structure: in the case of indices of names, places, etc. the database entries were provided with a maximum of three fields for representation of the hierarchical levels, additional to one field for each shelfmark and page reference to the containing manuscript. In case of indices for incipits this would be reduced to three fields, because only one is needed for the entry itself. Such flat data could be handled with database systems of these times easily.
ManuMed: The structure
When the Staatsbibliothek Berlin - Preussischer Kulturbesitz, the Bayerische Staatsbibliothek München and Bildarchiv Fotomarburg chose to maintain the database, the data were exported into the structure already in use in Marburg. Since the late 1970's, at Fotomarburg a format for art historical information had been developed and extensively tested with sample data. It had been named MIDAS (=Marburger Inventarisations-, Dokumentations-und Administrationssystem). This format organises the information both hierarchical and according to the entity-relationship-model: information on each entity should be stored only once. Yet it was very much data-centric which means that texts would not be stored in a way to meet human needs to "just write texts" but information would have to be broken down into small pieces. This characteristic was useful in respect to the storage of index entries. A software called HiDA (=Hierarchischer Dokument-Administrator) was developed to handle the data. But still, this system was made for inventory, cataloguing, and description of pieces of art and architecture. A sample description in the initial format, containing the main object block and two blocks for index entries, one for the texts contained in the manuscript and the other about the mentioning of Bobbio monastery. Last follows the reference to the catalogue where the description has been taken from.
blk= obj
5000= 90237684,T
bezsoz= Verwaltung
4564= Wolfenbüttel
4600= Herzog August Bibliothek
4650= 64 Weissenburg
4590= Öffentliche Bibliothek
4604= HAB 5230= Handschrift
[...]
blk= t2
5001= 90237685,KRZ
5230= Registereintrag
[...]
1200= Vetus Latina
1210= Paulus apostolus
1220= Ulfilas
1200gi= Vetus Latina
1210gi= Paulus apostolus
1226gi= Ulfilas
[...]
blk= t2
5001= 90237691,KRZ
bezsoz= Erwähnung
4564= Bobbio
4600= Kloster Bobbio
5230= Registereintrag & Text
8450= Katalogreproduktion
8540= HSK0078_b204
[...]
1200gi= Provenienz I
1210gi= Bobbio
Textual hierarchies: The OHCO-thesis
MIDAS, and as its name suggests, HiDA organise and store information in hierarchies. Now, it seemed perfectly reasonable to organise information about art and architecture in a hierarchical way: The altar screen can be described as a whole, after that each wing, the predella, on each of these parts the paintings in greater detail, etc. The same holds true for architecture: A chateau like the one in Fontainebleu consists of a whole, might have wings, each wing different floors, and each floor several rooms, etc.
Now, speaking about texts, the structure of MIDAS and how manuscript descriptions are stored reminds us immediately of the OHCO-thesis (=Ordered Hierarchy of Content Objects). According to this, a text is made up by different content objects which are ordered in hierarchies (Cf. Coombs et.al. 1987). Thus, when manuscript data had to be included in the system, the proposed structure orientates at the physical object -as cultural heritage artefact and object to the research of art history. As a physical object the parts of the description would be again first of all the description of the manuscript as a whole, of the binding, of the book block and of any composite parts. The description of the book block, which is considered to be the text-bearing-object would contain texts that might be divided into divisions as needed.
Sadly, already in 1993 Renear and others have shown that the OHCO thesis in its simplest form does not explain how to structure a text through markup, let alone in a hierarchical database. (Cf. Renear 1993)
Another aspect and problem is, that the object-centeredness of MIDAS bears a difficulty for the constitution of a MIDAS-TEI crosswalk. Within MIDAS it would be necessary to split up a description of a text that runs from the binding into the book block according to the physical structure. What is on the binding has to be described and stored in the descriptive block of the binding, what is in the book block has to be described and stored in the descriptive block of the book block. The TEI approach is crosswise to this as the text is in the centre of attention, even in the msdescription module.
Third, the manuscript database was meant not only to contain the index information of formerly printed catalogues but cataloguing should take place right into it. It does not need to much imagination to realise that the data centeredness of the system and the structure was the main reason not to fulfil the task. As it has not been possible to enter full texts in the database but merely fill index fields, the interest in adding new descriptions directly into the system has been very low.
The new version: HiDA4
Having this in mind a new generation of the software has been developed since 2004: HiDA4. Background to this development has been the insight, that a) manuscript catalogues still shall be published in print and b) full text searches in the descriptions combined with few index based searches would be sufficient for most purposes. Because of these reasons the format MIDAS has been adopted to include some full text fields that contain text even with its visual properties in RTF-format. Additionally to that, it is possible to either link from portions of the full text to a index field or immediately enter information in the index field. The full text fields will serve the print preprocessing while the index fields are useful for searches. Such full text fields exist for the main content blocks of a manuscript description which are
- the shelfmark
- the manuscript title
- the Schlagzeile, containing information about material, extent and size, date and place of origin
- the physical description
- the history
- the bibliography
- the contents
- in cases the manuscript is composite parts
The flat XML of HiDA4
The distinction between full text fields that contain text with layout prepared for printing and index fields that contain standardised information for searching allows for exactly these two applications: print and search. Having only these two use-cases in mind one might be perfectly satisfied with the solution provided by the HiDA/MIDAS system. On the other hand one might criticise the XML which is one possible output format as the XML provided is XML only in a technical sense but its structure does not represent the complexity of the data contained.
An Example: The above shown content of ManuMed, given in the HiDA3 format would look like this in HiDA4:
<h1:DocumentSet xmlns:h1="http://www.startext.de/HiDA/DefService/XMLSchema"> <h1:ContentInfo> <h1:Format>HIDA-DOC1-XML</h1:Format> </h1:ContentInfo> <h1:Document DocKey="obj 90237684,T"> <h1:Block Type="obj"> <h1:Field Type="5000" Value="90237684,T"/> <h1:Field Type="bezsoz" Value="Verwaltung"> <h1:Field Type="4564" Value="Wolfenbüttel"/> <h1:Field Type="4600" Value="Herzog August Bibliothek"/> <h1:Field Type="4650" Value="64 Weissenburg"/> <h1:Field Type="4590" Value="öffentliche Bibliothek"/> <h1:Field Type="4604" Value="HAB"/> </h1:Field> <h1:Field Type="5230" Value="Handschrift"/> [...] <h1:Block Type="t2"> <h1:Field Type="5001" Value="90237685,KRZ"/> <h1:Field Type="5230" Value="Registereintrag"/> [...] <h1:Field Type="1200" Value="Vetus Latina"/> <h1:Field Type="1210" Value="Paulus apostolus"/> <h1:Field Type="1220" Value="Ulfilas"/> <h1:Field Type="1200gi" Value="Vetus Latina"/> <h1:Field Type="1210gi" Value="Paulus apostolus"/> <h1:Field Type="1226gi" Value="Ulfilas"/> [...] <h1:Block Type="t2" FieldsCount="15"> <h1:Field Type="5001" Value="90237688,KRZ"/> <h1:Field Type="5230" Value="Registereintrag &Handschrift"/> <h1:Field Type="5130" Value="Bobbio"/> <h1:Field Type="8450" Value="Katalogreproduktion"/> <h1:Field Type="8540" Value="HSK0078_b204"/>; </h1:Field> [...] <h1:Field Type="1200gi" Value="Provenienz I"/> <h1:Field Type="1210gi" Value="Bobbio"/>
The following, abbreviated DTD shows the underlying structure of the document which is quite simple:
<!ELEMENT h1:DocumentSet (#PCDATA | h1:ContentInfo | h1:Document)*> <!ELEMENT h1:Document (#PCDATA | h1:Block)*> <!ELEMENT h1:Block (#PCDATA | h1:Field | h1:Block)*> <!ELEMENT h1:Field (#PCDATA | h1:Field)*> <!ATTLIST h1:Field Type CDATA #IMPLIED> <!ATTLIST h1:Field Value CDATA #IMPLIED>
This DTD does not at all reflect the complexity of the data. The real structure instead is hidden in a proprietary definition file, called Definitionsdatei which is a kind of schema language of its own. At least the Definitionsdatei might be changed by the user to meet his/her needs which makes the system quite flexible but still presupposes the knowledge about the intrinsic structure of the Definitionsdatei.
A conversion of the Definitionsdatei into e.g. XML-Schema would be possible but is not provided by the vendor. Because of that the validation of data outside the provided software HiDA is not possible or at least not wanted. For the worst, by storing all data in attributes, no substructures inside the data are possible. To include RTF-markup is possible only because it is not expressed in XML but rather using cambered brackets. Anyhow, the RTF can be exported as such or via RichXML.
The state-of-the-art HiDA-TEI crosswalk
The export of data from HiDA/MIDAS and thus from ManuMed in XML is possible. Via a crosswalk implemented in XSLT-scripts maintained by the Herzog August Bibliothek Wolfenbüttel is it possible to convert HiDA data into TEI-P5 data and vice versa. Now that this is possible one still has to look at the presumptions of the conversion respectively examine the output results. If a manuscript description stored in HiDA/MIDAS format and exported to TEI the result at certain points will be a very flat structure, e.g. only a series of paragraphs instead distinguished elements such as <origin>, <provenance> or <acquisition>. Even if many links from a full text into index fields have been set, it will be close to impossible to generate a deeply structured text like for example the <physDesc> normally will be one.
The conversion from TEI towards HiDA/MIDAS on the other hand results in much better documents, in respect to the target structure. As the target structure HiDA/MIDAS has only plain text plus some index fields filled, one has to make sure that
- entire TEI elements such as <msContents>, <msItem>, <physDesc> or <history> will be delivered with appropriate spacing into the output file
- contents that should be available within index fields for searching will be written in the right places.
AMREMM
AMREMM Descriptive Cataloging of Ancient, Medieval, Renaissance, and Early Modern Manuscripts is an application of MARC to medieval manuscript cataloguing. The development goes back to the mid-1990's when in an Mellon-founded project some grounds have been laid. The description of the standard has been published only in 2003. (Cf. Pass 2003.) As it is a relatively young standard, only few libraries are known to use it. Less than a handful libraries in the U.S. and maybe in the near future the University Library of Basel are active users of the standard.
If we consider the two formats, HiDA/MIDAS and TEI as extreme positions with HiDA/MIDAS being extremely flat structure with some highlighted, i.e. linked, information on one side and TEI with its extremely deep structures on the other side, one might interpret AMREMM to be somewhere in between these two.
It is due to the fact that AMREMM is a data-centric format to a certain point that I would group it with HiDA/ MIDAS at one end of the spectrum of possible encoding methods. Anyway, different from the latter, AMREMM offers a lot of repeatable fields, whose content is often prefixed by a headword like collation, layout, or origin and with that distinguishes the fields from each other. And, within the fields sub-fields can be established, something which makes not up for a smooth running text but is helpful for both grouping and separating bits of information. This option to realise substructures makes the difference for me to characterise AMREMM as well as a bit like TEI.
A crosswalk between AMREMM and TEI has not been established right now. Yet, as AMREMM might be used in Basel, such a crosswalk might be written within the project e-codices in the near future. One can expect that such a crosswalk will work well in both directions.
TEI as interchange format
Having examined HiDA/MIDAS and AMREMM, these prove to be mere bibliographical standards to serve certain needs with all its pros and con.s
- Their flat structures guarantee a much lower learning curve.
- The supply of index fields offers possibilities for powerful searches on regularised and normalised data.
- With the possibility to incorporate RTF-markup HiDA/MIDAS allows for print out-of-the-box.
- Both standards do not make full use of the capabilities of XML. Both do not use nested structures to such a depth as TEI does.
In a broader sense, TEI is not very useful as interchange format. Both other formats have structures less deep than TEI. As explained in detail, much of the structural information of a TEI document will be lost if it is transformed into HiDA/MIDAS. The same holds true for TEI documents transformed into AMREMM, even if to a lesser extent. The other way round, TEI provides all fields needed to store the information -even the structural information- of the other formats. To conclude, TEI should better be used as source format rather than as target format. Still, it can be used as such.