Week 1: Introduction to Computational Methods in communication science
Questions of this course: How can we analyze large amount of texts?
What is computational social science? And why should we care?
Example: surprising sources of information
• In 2009, researchers wanted to study wealth and poverty in Rwanda
• They conducted a survey with a random sample of 1,000 customers of the largest
mobile phone provider
• They collected demographics, social, and economic characteristics (incl. wealth)
• So far, traditional social science, right?
• The authors also had access to complete call records from 1.5 million people
• Combining both data sources, they used the survey data to “train” a machine learning
model to predict a person’s wealth based on their call records
• They also estimated the places of residence based on the geographic information
embedded in call records
Meaning of computational social science
• Field of social science that uses algorithmic tools and large/unstructured data to
understand human and social behavior
• Complements rather than replaces traditional methodologies: Methods are not the
goal, but contribute to data generation
• Includes methods such as, e.g.,:
▪ Data mining (e.g., scraping and gathering of large data sets)
▪ Software development for social science experiments
▪ Automated text analysis (e.g., sentiment analysis, keyword extraction, dictionary
approaches)
▪ Image classification (e.g., face recognition, visual topic modeling)
▪ Machine learning approaches (e.g., for classification, prediction, topic modeling)
▪ Actor-based modeling (e.g., simulation of social behavior, spreading of
information)
Why is this important now?
• Vast amounts of digitally available data, ranging from social media messages and
other digital traces to web archives and newly digitized newspaper and other
historical archives
• Large-scale records (big data) of persons or businesses are created constantly
• Powerful and comparatively cheap processing power, and easy to use computing
infrastructure for processing these data
• Improved tools to analyze this data, including network analysis methods and
automatic text analysis methods such as supervised text classification, topic
modeling, word embeddings, as well as large language models
,10 characteristics of big data
Pro’s and con’s of computational methods
Opportunities
• We can study actual behavior instead of simply self-reports
• We can study human beings in their social context instead of in an artificial lab setting
• We can increase our N (higher power)
• Potential to uncover patterns and insights that we couldn’t investigate before
Pitfalls
• Techniques often (rather) complicated
• Data is often proprietary (not shared openly)
• Samples are often biased
• Often, data have only insufficient metadata
• Risks of no longer understanding the models we use (black box)
Computational communication science. Why computational methods are important for
communication research
,Definition:
“Computational Communication Science (CCS) is the label applied to the emerging subfield
that investigates the use of computational algorithms to gather and analyze big and often
semi- or unstructured data sets to develop and test communication science theories”
Typical research areas
Computational communication science studies thus usually involve:
1. large and complex data set
2. consisting of digital traces and other “naturally occurring” data
3. requiring algorithmic solutions to analyze (e.g., machine learning, LLMs)
4. allowing the study of human communication by applying and testing communication
theory
• Political Communication
▪ Democratization and Polarization
▪ Hate Speech
• Social Media Use
▪ Tracking of actual social media use
▪ Spreading of behavior, information, or emotions
• Health Communication
▪ Prevalence of health information online
• (Online) Journalism
▪ News coverage across decades
▪ Gender equality
Example 1: Analyzing news coverage
• Jacobi and colleagues (2016)
analyzed the coverage of nuclear
technology from 1945 to 2014 in the
New York Times
• Analysis of 51,528 news stories
(headline and lead): Way too much
for human coding!
• Used “LDA topic modeling” to extract
latent topics and analyzed their
occurrence over time
Example 2: Facebook data to predict personality
, • Kosinski and colleagues (2013) used a dataset of over 58,000 volunteers who
provided their Facebook Likes, detailed demographic profiles, and the results of
several psychometric test
• Were able to show that one can predict a variety of personal characteristics and
personality traits from simple Facebook likes
Example 3: Gender representation in tv
• Women on average
remained
underrepresented on TV,
with 6.3 million female
faces out of 16 million
total (estimated
proportion .39, 95% CI:
.37-.42)
• This strong overall bias
was mirrored across
specific subsamples
(news, sports,
advertising…)
Introduction to Automated Text analysis