The German Text Archive
(Christiane Fritze)
The history of the Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) reaches back to 1700 when the Kurfürstlich-Brandenburgische Societät der Wissenschaften was established by Gottfried Wilhelm Leibniz. One of the purposes of the Academy is to promote cultural heritage. Therefore, it is not surprising that the Academy supports the “Deutsches Text Archiv” (DTA – German Text Archive), a project funded by the German Research Foundation.
The aim of the DTA is to establish a core corpus of the most important works in the German language dating from the beginning of letterpress printing until the present. In the first project phase from July 2007 to June 2010 the focus is on digitising about 750 texts (circa 200,000–250,000 pages) from between 1780 and 1900. For the texts from the 20th century we will be cooperating with the project group for the Digital Dictionary of the German Language (DWDS), a TEI P5 XML linguistically annotated corpus of one billion tokens.
The selection of the 750 works is based upon recommendations given from three groups: firstly the editors of the famous “Deutsches Wörterbuch” by Jacob Grimm and Wilhelm Grimm, secondly the Academy members and thirdly the members of the “Arbeitsgemeinschaft Sammlung Deutscher Drucke” consortium (six libraries collaborating to build a comprehensive collection of printed literature published in German-speaking countries).
Hence, we work on the most important German texts of different genres such as fiction, science, technology, medicine, philosophy, law. We digitise almost exclusively the first editions.
The texts will be highly structured in a customisation of the TEI P5 XML format allowing a word by word reference of the images with the fulltext and vice versa. The reference will be based on the encoding of coordinates for each character for most of the corpus texts. Afterwards, they will pass computational linguistic routines such as tokenisation, morphologic analysis, part-of-speech tagging and orthographical mapping.
Also, metadata sets such as METS or Dublin Core will be derived from the TEI P5 XML format to allow, on the one hand, interchange with the libraries whose books we digitised. On the other hand they are needed for the digital scholarly portals in Germany as our potential future cooperation partners.
Another aim of the DTA is to be an “active archive” which allows additional annotations made by registered third party users. In addition it is planned to offer the possibility to extend the corpus with further texts after passing a control mechanism.
The texts will be available for download in different formats as long as they do not fall under any copyright restrictions. All the software developed to support the workflow will be published under an open source licence.
Finally, we want to give an overview of the project’s workflow, the tools for supporting pretagging and OCR correction as well as some technical details of the DTA online framework including phonetic and linguistic search possibilities.