You are using an old browser with security vulnerabilities and can not use the features of this website.
Those who spoke Early New High German lived in a revolutionary era: With the invention of letterpress printing, writing became a mass phenomenon. Lawyers, clerics, book printers, adventurers, natural scientists – more people were writing than ever before, increasingly in German. The number of text types exploded: weekly papers were being published, merchants were putting contracts down in writing, and private people were writing letters and diaries. At the same time, paper supplanted the significantly more expensive parchment. For a number of years, linguists at the University of Potsdam have been compiling a digital database to establish a representative corpus of this period with linguistic annotations. The “Reference Corpus Early New High German” will soon make it much easier to learn more about people of the early modern period and the language they spoke.
Corpus, annotation, treebank – behind these terms lies a major linguistic project. Potsdam linguists Ulrike Demske, Katrin Goldschmidt, and Marianna Patak are compiling a digital database. “We are creating a comprehensive resource for academics who are researching the historical syntax of the German language,” Demske explains. The project addresses a problem that faces generations of linguists: Before a particular linguo-historical aspect could be studied, tedious data collection had to be done. For instance, Demske wrote her doctoral thesis on the history of modal infinitives, for which she evaluated many texts, including Wolfram von Eschenbach’s verse novel “Parzival”. “This long epic contains only eight such infinitive constructions, a low output considering the length of the text.” In order to even begin evaluating her thesis, she meticulously combed through several historical texts until she was able to compile a sufficient number of sentences that would help support her thesis. “The compilation of my data corpus alone took an immense amount of time,” Demske says.
In the future, the reference corpus will significantly reduce such effort. The University of Potsdam and two other universities are involved in this German Research Foundation-funded project. Researchers at Halle and Bochum are working on transcribing and digitalizing manuscripts and printed texts in Early New High German. By March 2017, some 5 million words and their variations – called “word forms” by experts – will be digitalized as true to the original as possible. The researchers in Halle and Bochum are also adding initial linguistic information to the texts, i.e. assigning information about part of speech to each word and referencing the occurrence of each word to a lexical entry. At Potsdam, about half a million word forms will be tagged with syntactic information. This very time-consuming task will make future linguo-historical research much easier.
The syntactic annotations have been completed by ten student assistants, two of whom annotate the same part of a text independently and later compare their results with the help of a computer program. Research assistants Goldschmidt and Patak then assemble the matched parts into full texts. This “double keying” system ensures the reliability of syntactic annotation, minimizing flaws in the annotation process. “The student assistants create tree diagrams for every sentence in the text,” Patak explains. Based on the assignment of the part of speech, the tree diagrams display the inner structure of word groups and sentences and determine the syntactic function of each word group. Given that researchers from all over the world will be accessing the syntactically annotated texts, the student assistants have been entrusted with a very important task. “They take this very seriously; one student told me every sentence was like a little puzzle she wanted to solve,” Patak says.
Before the student assistants can add linguistic information to individual sentence structures, the respective texts need to be segmented, i.e. manually cut up into sentences. “Historical punctuation differs greatly from its current form,” Goldschmidt says. In those days, the virgule – a slash – was used much more than the period. Its initial function of marking speech breaks was gradually overtaken by the modern comma. Periods were also rare, which means texts written in Early New High German cannot be automatically segmented as they can be today.
The Potsdam linguists use the program @nnotate, which was developed by a computer linguist in Saarbrücken in the late 1990s. “It is a semi-automatic program,” Demske explains. “The more linguistic information available in the form of annotated texts, the better the program’s suggestions, especially with regard to the part of speech and the structure of simple word groups. The often very lengthy compound sentences of Early New High German, however, have to be annotated manually by the assistants. The transcribed and annotated texts will ultimately be accessible on the online linguistic database ANNIS, a platform developed by computer and corpus linguists from Potsdam and Berlin.
“When looking at our progress, I sometimes get a bit impatient,” Demske says. She would love nothing more than to syntactically annotate even more texts even faster and to quickly make available a syntactically annotated corpus of several million Early New High German word forms. The project staff at Potsdam has come to accept that each sentence demands time, ultimately saving time for those gathering data from the reference corpus in the future. “Finding texts in Early New High German from 1350-1650 to include in the reference corpus is in itself not a problem,” Demske explains. “Since we intend to set up a structured corpus representative of this historical period of the German language, we try to consider texts from all dialect areas within 50-year spans – which does not work equally well for all dialect areas.” For instance, there are relatively few texts from the Moravian-Bohemian language region, due in part to the very few paper mills and printing shops there. Also, not every text is equally suitable for syntactical annotation: texts in bound speech are not considered since the word order often does not correlate to the everyday spoken language. Other text types such as official documents or court records do not qualify either because they contain much formulaic repetition; “for our purposes, their syntax is not diverse enough.”
Texts from the Early Modern Age are undoubtedly worth reading: “I am fascinated by 16th-century travel accounts – one of my favorite registers,” Demske says. In one of the texts included in the reference corpus, natural scientist and doctor Leonhard Rauwolf describes his journey to the Near East and reports on its bathing culture. Surprisingly, it seems to differ little from that found in today’s wellness spas. Goldschmidt remembers the story of adventurer Hans Staden, who was supposedly captured by a cannibal tribe in South America. “He tried to escape the stake by offering European healing methods to cure the tribal chief from an epidemic – and succeeded.” The three linguists agree, though, that all of the texts are interesting. For instance, “The Red Book of the town Ulm,” an early source, describes the process in which the citizens of Ulm laid down basic rules for living in their town: Who was allowed to get married? Who was allowed to exchange money? What was the proper way to bake bread? By researching their language, the linguists are certainly learning more about the people of this period, their thoughts, and how they lived.
The “Reference Corpus Early New High German” is just a part of the linguistic corpora for historical German. Similar projects have been done in Berlin, Bochum, and Bonn for Old and Middle High German, albeit with less syntactic information than the Potsdam researchers are including in their historical texts. Once the digitalization and annotation of Early New High German texts is complete, German philologists and linguists all over the world will, with the help of ANNIS, be able to follow the development of the German language, at least with regard to selected issues in texts written between the 8th and 17th centuries. Early New High German will be the first historical stage of the language for which syntactic patterns can be searched and found through ANNIS.
Researchers who are led by Prof. Dr. Ulrike Demske of the University of Potsdam, Hans-Joachim Solms of the University of Halle, and Klaus-Peter Wegera and Stefanie Dipper of the University of Bochum are compiling the “Reference Corpus Early New High German”. They are assembling literary monuments of the Early Modern Age (between 1350 and 1650) in High German, transcribing and digitalizing them, and lemmatizing and annotating them morphologically and syntactically. The selection of texts is motivated by the categories of region, time, and form of documentation. The goal of the project is to create a comprehensive, reliable database of Early New High German as used in manuscripts and printed books. The project is being funded by the German Research Foundation (DFG) from 2011–2017.
Prof. Dr. Ulrike Demske studied German philology and geography at the Universities of Tübingen and Aix-en-Provence. She received her doctorate at the University of Tübingen in 1993 and habilitated at the University of Jena in 1999. Since 2011, she has been Professor of History and Variation of the German Language at the University of Potsdam.
Institut für Germanistik
Am Neuen Palais 10, 14469 Potsdam
Katrin Goldschmidt studied general and German linguistics, journalism and communication science as well as editorial studies at Freie Universität Berlin. At the University of Potsdam, she was a research assistant in the DFG project from October 2012-December 2015.
Marianna Patak studied Slavic languages and literatures and German linguistics at Humboldt-Universität zu Berlin (BA) as well as linguistics (MA). In August 2015, she joined the DFG project at the University of Potsdam as a research assistant.
Text: Jana Scholz
Online gestellt: Matthias Zimmermann
Kontakt zur Online-Redaktion: email@example.com