You are using an old browser with security vulnerabilities and can not use the features of this website.

Here you will see how you can easily upgrade your browser.

Layer by Layer – Project group develops new method to enable a complex analysis of social scientific texts

Potsdam computer linguists develop instruments to enable linguistic analysis of millions of newspaper articles. Picture: F. Betz / pixelio

Potsdam computer linguists develop instruments to enable linguistic analysis of millions of newspaper articles. Picture: F. Betz / pixelio

Modern information and communication technologies have been changing all spheres of society and have long been indispensable at German universities. They have recently become important in projects where humanities scholars and social scientists collaborate with scholars of IT-related subjects to develop new research approaches. This field is called “eHumanities”. Part of this research field is a group project that involves computational linguists from the University of Potsdam. Together with researchers in Stuttgart and Hildesheim, they are developing tools and methods that enable political scientists to search for and analyze identity discourses about wars and humanitarian military interventions between 1990 and 2011 in large text corpora – on a completely new level. 

How do international actors, like NATO, the United Nations, or heads of states mobilize collective identities in crisis situations? In order to garner support for their own positions, do they resort to pitting ethnic, religious, cultural, European, and transatlantic commitments against each other or do they refrain from doing so? What are the causes and effects of such identity politics? Science has yet to sufficiently answer these questions. Researchers at the universities in Stuttgart, Potsdam, and Hildesheim have therefore been investigating international debates on war and humanitarian military interventions since the end of the Cold War in the presses of some European countries (Germany, Austria, Ireland, France, Great Britain) and the USA. The Federal Ministry of Education and Research has funded the Potsdam team alone with about €140,000 until spring 2015.

The researchers are examining about one million newspaper articles, including print media like the Frankfurter Allgemeine Zeitung and the Washington Post. Political scientists hope this will allow them to study specific issues in their fields more comprehensively and with semi-automated methods. They are currently analyzing if and how such identity issues are reflected in the selected print media, if they change over longer periods, and which mechanisms form identities. They need, however, suitable linguo-technological tools to cope with the complexity of the relevant indicators and the large corpora. 

The project group hopes to develop these very tools. “We are breaking new ground,” says Prof. Manfred Stede, computational linguist at the University of Potsdam. “‘New ground’ because we want to develop a transparent Complex Concept Builder – CCB – that can be used individually to operationalize complex technical terms and apply them to everyday language texts in an interactive procedure. This has not been done before.” The CCB will integrate tools to enable political scientists to analyze relations as well as assessments of speakers. It will be complemented by the “Exploration Workbench”, which will collate – that is, harmonize – the disparate texts to ultimately make them comparable and machine-readable. From sources with various data format, it creates consistent articles with clearly distinguishable headlines, teasers, and bodies. It has been important to the researchers from the outset to develop an analytical tool for everyone in social scientific community to examine large corpora.

The researchers have made good progress on the Exploration Workbench, and political scientists from Stuttgart are already working manually with many tools of gradually developed modules. All annotations by the political scientists can be made with the CCB. Articles are classified according to topic and genre. Nevertheless, a lot still has to be done. “We have to continue to push ahead with the CCB,” Stede underlines. This is a huge task because the expectation is for it to be able to react to queries that not only contain individual words but describe concepts. The system should, for example, provide texts in which “a head of government in a Middle Eastern country announces he will not be taking part in a conflict in the Arab region.” It then is supposed to be able to automatically “spell out” the individual sentence levels, including the purely linguistic ones: potential heads of state, topic, conflict, or concrete type of statement or reluctance by a head of state. For the researchers this means that they have to model a lot of knowledge and also save lexical relations. They are still intensively working on this.

The success of the project will also depend on how well Stede and his PhD student Jonathan Sonntag master their specific task: They are going to develop a tool that automatically analyzes sentences and sentiments, i.e. a tool that, to return to the previous example, recognizes if an Arab head of state refuses to participate in a conflict. This work is extremely tedious and open-ended because computer linguistics has yet to satisfactorily resolve how to distinguish sentiments and opinions from objective statements in texts. “Other analytical levels, like syntactic relation and coreference, are used to calculate such sentiment relations and express principles,” Sonntag explains his approach. In his dissertation he analyzes sentences like “The Swiss confederates cannot help rubbing this fact in the faces of foreign countries at times.” Does “rubbing something in someone’s face” always refer to a negative attitude of the author or speaker? Such questions as well as the starting point of sentiment relations interest him. Do they always begin with the subject and refer to the object? “Definitely not,” Sonntag can already say. What does the spectrum of negative expressions look like? How do writers moderate their assessments? The PhD student wants to know all these things in detail. 

“The question of subjectivity is indeed extremely important for us at the moment,” Stede underlines. “We have already developed a program that can classify texts accordingly and sort them as news or opinion.” The team looked for adjectives tinged with subjectivity as well as for linguistic means like modal verbs – and found a whole conglomeration of features that mark the character of texts. Words like “should, have to, could” are more likely in commentaries than in other text types; negations like “is not the case” or “did not take place” are rarely used in the news. “We will certainly not be able to give an ultimate answer on how to use the programs to separate objective statements from subjective ones,” Stede says. “Our approach, however, is promising.”

The entire project does present a real challenge to the researchers from various disciplines but interests Stede in two respects: technically and content-wise. It is technically interesting because they want to create tools that allow social scientists to efficiently search in extremely large corpora. Content-wise it is interesting because it will not only be a semantic search but also a conceptual one. “If we are able to manage both aspects, it will be a major step forward,” the professor says.

The Project

Multiple collective identities in international debates regarding war and peace since the end of the Cold War. Language technological tools and methods for the analysis of multi-lingual text in the social sciences.

Project Coordinator: Prof. Cathleen Kantner (University of Stuttgart)
Duration: 2012–2015
Funded by: German Federal Ministry of Education and Research
Website: www.uni-stuttgart.de/soz/ib/forschung/Forschungsprojekte/eIdentity 

The Researcher

Professor Manfred Stede studied computer science and linguistics at the Technische Universität Berlin; in 1996 he earned his PhD in computer science at the University of Toronto. Since 2001 he has been Professor of Applied Computational Linguistics at the University of Potsdam.


Universität Potsdam
Department Linguistik
Karl-Liebknecht-Str. 24–25, 14476 Potsdam 
E-Mail: stede@ling.uni-potsdam.nomorespam.de

Jonathan Sonntag studied computational linguistics at the University of Potsdam and graduated with a diploma in 2012. Since then he has been a research assistant on the eIdentity project.


E-Mail: jonathan.sonntag@yahoo.nomorespam.de 

Text: Petra Görlich
Online-Editing: Agnes Bressa
Contact Us: onlineredaktion@uni-potsdam.nomorespam.de