Deutsches Textarchiv (The German Text Archive)
- Host: Berlin-Brandenburg Academy of Sciences and Humanities
- Other institutions involved: Funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG)
- URL:
- Email:
- Main language: German
- TEI Encoding Guidelines used: XML/TEI-P5 subset DTA “Base Format” (DTABf) (cf. Implementation description below)
General description: The DFG-funded project Deutsches Textarchiv (DTA) started in 2007 and is located at the Berlin-Brandenburg Academy of Sciences and Humanities (Berlin-Brandenburgische Akademie der Wissenschaften, BBAW). Its goal is to digitize a large cross-section of printed works in modern New High German Language, ranging from ca. 1600 to 1900. Images and electronic full-text are available online, the latter can be downloaded as HTML, XML, TCF or plain text. The DTA presents almost exclusively the first editions of the respective works. Currently (April 2016), there are 2422 texts dating from 1600–1900 online, and over 400 more are prepared to be published, comprising a total of more than 650,000 digitized pages with around 1.1 billion characters and roughly 157 million tokens.
The majority of DTA’s texts is transcribed by non-native speakers using the double keying method (vendors guarantee 99.9+% character accuracy). The DTA provides linguistic applications for its corpus, i. e. tokenization, lemmatization, lemma based and phonetic search, and rewrite rules for historic spelling.
All DTA texts are freely available for download in different formats: the original XML/TEI texts, an HTML rendered version, two different kinds of TCF versions, the raw text transcription. Moreover, CMDI metadata comprising TEI header information may be harvested via OAI-PMH.
Implementation description: Each text in the DTA is encoded using the XML/TEI-P5 format. The markup describes text structures (headlines, paragraphs, speakers, poem lines, index items etc.), as well as the physical layout of the text down to the position of each character on a page. The text annotation follows the DTA “Base Format” (DTABf), a customization of the TEI P5 Guidelines. The DTABf consists of about 80 TEI P5 text elements which are needed for the basic formal and semantic structuring of the DTA reference corpus. The purpose of developing the DTABf was to gain coherence at the annotation level, given the heterogeneity of the DTA text material over time (1600-1900) and text types (fiction, functional and scientific texts). More, frequently updated information on the DTABf here: (description), (overview: table elements within text).
Access: Open access / CC BY-NC
